Keywords
R package, UK Biobank, COVID-19, GWAS, risk factors
This article is included in the RPackage gateway.
This article is included in the Emerging Diseases and Outbreaks gateway.
This article is included in the Coronavirus (COVID-19) collection.
COVID-19 caused by SARS-CoV-2 has resulted in a global pandemic with a rapidly developing global health and economic crisis. Variations in the disease have been observed and have been associated with the genomic sequence of either the human host or the pathogen. Worldwide scientists scrambled initially to recruit patient cohorts to try and identify risk factors. A resource that presented itself early on was the UK Biobank (UKBB), which is investigating the respective contributions of genetic predisposition and environmental exposure to the development of disease. To enable COVID-19 studies, UKBB is now receiving COVID-19 test data for their participants every two weeks. In addition, UKBB is delivering more frequent updates of death and hospital inpatient data (including critical care admissions) on the UKBB Data Portal. This frequently changing dataset requires a tool that can rapidly process and analyse up-to-date data. We developed an R package specifically for the UKBB COVID-19 data, which summarises COVID-19 test results, performs association tests between COVID-19 susceptibility/severity and potential risk factors such as age, sex, blood type, comorbidities and generates input files for genome-wide association studies (GWAS). By applying the R package to data released in April 2021, we found that age, body mass index, socioeconomic status and smoking are positively associated with COVID-19 susceptibility, severity, and mortality. Males are at a higher risk of COVID-19 infection than females. People staying in aged care homes have a higher chance of being exposed to SARS-CoV-2. By performing GWAS, we replicated the 3p21.31 genetic finding for COVID-19 susceptibility and severity. The ability to iteratively perform such analyses is highly relevant since the UKBB data is updated frequently. As a caveat, users must arrange their own access to the UKBB data to use the R package.
R package, UK Biobank, COVID-19, GWAS, risk factors
The newly revised article contains additional information as suggested by the reviewer, which includes 1) a discussion of long COVID and relevant functions in the UKB.COVID19 R package; 2) a "Statistical analysis" section in the methods section; 3) a vignette in the UKB.COVID19 R package.
                    See the authors' detailed response to the review by Edgar Gonzalez-Kozlova
                    See the authors' detailed response to the review by Virginia Valeria and Annalisa De Silvestri
                    See the authors' detailed response to the review by Thomas Michael Palmer
            
The ongoing global pandemic of coronavirus disease 2019 (COVID-19), caused by a novel coronavirus (severe acute respiratory syndrome coronavirus 2, SARS-CoV-2), has resulted in a rapidly developing global health and economic crisis. Most people with COVID-19 never develop symptoms or suffer mild symptoms. However, about 5% of cases are critical (defined as respiratory failure, septic shock, and/or multiorgan dysfunction or failure) (Wu and McGoogan 2020), possibly leading to lethal lung damage and even death. These and other clinical observations led to the hypothesis that genetic factors in either or both the host and the pathogen could be responsible, at least in part, for this variation. Worldwide scientists scrambled initially to recruit patient cohorts to try and identify genetic risk factors.
UK Biobank (UKBB) (RRID: SCR_012815) is a long-term biobank study that recruited 500,000 volunteers aged between 40–69 years in 2006–2010 from across the UK. UKBB’s large-scale database is a global research resource accessible to approved researchers who are undertaking health-related research. All participants provided detailed information about their lifestyle, physical measures and had blood, urine and saliva samples collected. The samples of all participants have undergone SNP array typing and are now also undergoing whole-exome and whole-genome sequencing. UKBB has become a major contributor to the advancement of modern medicine and treatment, enabling a better understanding of a wide range of serious and life-threatening diseases.
Researchers can apply for access to the data and worldwide hundreds of researchers are using the UKBB data to carry out research on many different diseases. The UKBB has facilitated first-time analyses on traits such as brain imaging phenotypes (Elliott et al., 2018).
The UK has been badly affected by COVID-19. As of 20 May 2021, there have been over 127,000 reported deaths in the UK, with an estimated 4.5 million infections. Worldwide there have now been more than 3 million reported deaths due to COVID-19, with continually increasing rates of infections in India and South America. The UKBB was an early, available population genetic resource that could be harnessed to better understand COVID-19 risk factors, and with its continuing evolution continues to serve as a powerful cohort to permit such studies.
UKBB has taken swift strides to help tackle the global pandemic by undertaking four major initiatives: serology study, COVID-19 repeat imaging study, coronavirus self-test antibody study and health data linkage. UKBB has been receiving COVID-19 test data for previous UKBB participants in England and has linked the test result data with health data. The test results data are being updated every two weeks. In addition, UKBB is making more frequent updates of death and hospital inpatient data (including critical care admissions) on the Data Portal. This rapidly changing dataset requires a tool that can process the up-to-date data as frequently as the data updates, in a standardised, reproducible, and somewhat automated manner to permit rapid re-analysis of the data and to also enable other researchers to use such a tool as a basis for their analyses.
Therefore, we developed an R package (version 4.0.5) UKB.COVID-19 which summarises COVID-19 test results, combines test results data with hospitalisation data and death register data, performs association tests between COVID-19 susceptibility/severity and potential risk factors (age, sex, blood type, socioeconomic status, comorbidities etc.) and generates input files for genome-wide association studies (GWAS). Ethics approval was granted through WEHI project 17/09LR by the WEHI’s Human Research Ethics Committee (HREC).
UKB.COVID19 was built in R (version 4.0.5) and currently depends on the following R packages: questionr, data.table, tidyverse, magrittr, here, and dplyr. COVID-19 related data files from UKBB can be directly imported in the R package without any pre-processing.
UKB.COVID19 is distributed as part of the CRAN R package repository and is compatible with Mac OS X, Windows, and major Linux operating systems. UKB.COVID19 is maintained at GitHub (https://github.com/bahlolab/UKB.COVID19). The archived source code can be found in http://doi.org/10.5281/zenodo.5174381 (Wang et al., 2021). All analyses are performed using R (version 4.0.5). All functions and descriptions are listed in Table 1.
| Function | Description | 
|---|---|
| risk_factor | Selects several potential non-genetic risk factors from the linked health data provided by UKBB and generates an output file including the selected risk factors for the downstream analyses. Automatically returns sex, age at birthday in 2020, socioeconomic status, self-reported ethnicity, most recently reported body mass index, most recently reported pack-years of smoking, whether they reside in aged care (based on hospital admissions data, and COVID-19 test data) and blood type. Function also allows users to specify fields of interest (field codes, provided by UK Biobank), and allows the user to specify more intuitive names for selected fields. | 
| makePhenotypes | Summarises COVID-19 test results data, death register data and hospital inpatient data and returns data.frame and outputs a phenotype file with phenotypes for COVID-19 susceptibility, severity or mortality. | 
| comorbidity_summary | Summarises disease history records of each individual from the hospital inpatient diagnosis data and generates a file including all comorbidities based on ICD10 code, which can be used in the comorbidity association tests. | 
| comorbidity_asso | Performs association tests using logistic regression models, adjusts the tested phenotype with covariates and outputs a table comprised of odds ratios (ORs), 95% confidence intervals (CIs) of ORs, and p-values for all the comorbidity categories. | 
| sampleQC | Collates genetic QC data, as provided by UKBB and outputs lists of samples for inclusion/exclusion, for use with PLINK (Purcell et al., 2007) and/or SAIGE (Zhou et al., 2018). Also outputs a csv file summary sample-level QC metrics. | 
| variantQC | Collates genetic QC data, as provided by UKBB and outputs lists of variants for inclusion in downstream analyses, for use with PLINK and/or SAIGE. | 
| makeGWASFiles | Output phenotype files, formatted to be used as input for GWAS, or other genetic analyses, with PLINK and/or SAIGE. | 
| log_cov | Performs association tests using logistic regression models. | 
COVID-19 test results data are being provided to the UKBB by Public Health England (PHE), Public Health Scotland (PHS) and SAIL Databank for English, Scottish and Welsh data respectively. The data have been updated approximately once every two weeks since 16 March 2020. Most samples tested for the COVID-19 disease-causing virus, SARS-CoV-2, are from combined nose/throat swabs. In intensive care settings, lower respiratory tract samples may also have been taken and analysed. The data consists of the encoded participant ID, date the specimen was taken, specimen type (e.g. nasal, nose and throat, sputum), the laboratory that processed the sample, whether the sample was reported as positive or negative for SARS-CoV-2, the requesting organisation description, as well as other variables. The test result data used in the analyses of this report are up to 6 April 2021.
The death register data includes the date of death, the primary and contributory causes of death, coded using the ICD-10 system. The death register data have been updated every one or two months. The death register data used in the analyses of this report are up to 23 March 2021.
The hospital inpatient data consist of seven tables: 1) HESIN: the overall master table, providing information on admissions and discharges, the type of admission and other information related to the inpatient record as a whole. 2) HESIN_DIAG: diagnosis codes (ICD-9 or ICD-10) relating to inpatient records, including primary diagnoses and secondary diagnoses. The primary diagnosis is the main condition treated or investigated during the relevant episode. A secondary diagnosis is a clinically relevant contributory factor or issue that impacts the primary diagnosis (including chronic conditions). 3) HESIN_OPER: operations and procedures codes (OPCS-3 or OPCS-4) relating to inpatient episodes. 4) HESIN_CRITICAL: a child table of HESIN containing further information about those hospital episodes that required treatment in a critical care unit. 5) HESIN_PSYCH: a sibling table to HESIN containing fields relating to administrative aspects of psychiatric admissions. 6) HESIN_MATERNITY: a sibling table to HESIN containing fields relating specifically to maternity admissions. 7) HESIN_DELIVERY: Information regarding a child born as a result of a HESIN_MATERNITY record, where applicable. In this study, we use the HESIN, the HESIN_DIAG, the HESIN_OPER, and the HESIN_CRITICAL tables. The hospital inpatient data used in the analyses of this report are up to 5 February 2021.
The makePhenotypes function defines multiple COVID-19 traits, related to susceptibility, severity and mortality, which may be used for association testing and GWAS (Table 2).
The COVID-19 related phenotypes output from the makePhenotypes function in the UKB.COVID19 R package.
For susceptibility analysis, we generated a proxy variable, which includes all participants who have been tested for COVID-19 and define those who received at least one positive result as cases. By 6 April 2021, 77,222 individuals in the UKBB had received COVID-19 tests and 16,562 had tested positive for COVID-19 on at least one occasion. The pheno.type = “susceptibility” option summarises the COVID-19 test results data and generates a susceptibility phenotype for association tests and GWAS.
Based on the World Health Organization (WHO) ordinal scale for clinical improvement, we classify severity into four levels. These levels are defined as 1) hospitalisation: individuals admitted to hospital with their primary diagnosis recorded as COVID-19. 2) critical care level 2: individuals required basic treatment in a critical care unit, such as non-invasive ventilation and continuous positive airway pressure, and with their primary diagnosis recorded as COVID-19. 3) critical care level 3: individuals required advanced treatment in a critical care unit, such as invasive ventilation and temporary tracheostomy, and with their primary diagnosis recorded as COVID-19. 4) mortality: individuals died due to COVID-19. The critical care information was summarised from the HESIN_CRITICAL table and the HESIN_OPER table. The critical care level 2 cases are the COVID-19 patients who required at least one “Critical care level 2 days” in the HESIN_CRIRICAL table or received basic respiratory support, such as, E85.2 non-invasive ventilation NEC, in the HESIN_OPER table. The critical care level 3 cases are defined as the COVID-19 patients who required at least one “Critical care level 3 days” in the HESIN_CRIRICAL table or received advanced respiratory support, such as, E85.1 invasive ventilation, in the HESIN_OPER table. The commonly used GWAS tools, such as SAIGE and PLINK, do not support ordinal categorical phenotypes. Therefore, we converted this ordinal variable into four binary variables named “hospitalisation”, “critical care”, “advanced critical care” and “mortality” (Table 2). However, users can get the ordinal variable by simply summing the four binary variables. We assume that participants who were tested COVID-19 positive but did not admit to hospital had no or mild symptoms and hence classified them as controls in severity phenotypes. We compare the test results data and the hospital inpatient data and correct any inconsistency between the two tables. As an example of data inconsistency, up to 5 February 2021, 130 individuals were admitted to the hospital due to COVID-19 but are not recorded in the test result data, while 33 individuals were admitted to the hospital due to COVID-19 but received basic negative COVID-19 test results. This inconsistency is resolved by retaining all 163 individuals and setting their COVID-19 test results as positive. The pheno.type = “severity” option combines COVID-19 test results data and hospital inpatient data and generates three phenotypes for each severity level.
For mortality, we include all individuals who received at least one positive test result and define those whose primary cause of death is recorded as being due to COVID-19 as cases. We also compare the test results data and the death register data and correct any inconsistencies. As an example, up to 23 March 2021, 205 individuals died from COVID-19 as reported by the death register data but are not recorded as having positive COVID-19 tests in the test result data while 39 individuals died from COVID-19 but received negative COVID-19 test results. The inconsistency is resolved by retaining all 244 individuals and setting their test results as positive. Therefore, in total 1,042 UKBB participants had died from COVID-19 by 23 March 2021. The pheno.type = “mortality” option combines the COVID-19 test results data and death register data and generates a mortality phenotype.
The makePhenotypes function returns results in data.frame format and outputs files in text format for the downstream association tests and genome-wide association tests using PLINK (RRID:SCR_001757) (Purcell et al., 2007) and SAIGE (Scalable and Accurate Implementation of GEneralized mixed model) (Zhou et al., 2018).
The risk_factor function generates formatted variables for several non-genetic risk factors from the linked health data provided by UKBB. These variables are all established risk factors for SARS-CoV-2 exposure, and/or COVID-19 severity (Pijls et al., 2021; Wolff et al., 2021; Booth et al., 2021). The currently selected risk factors are listed in Table 3. The multi-category variables are converted into multiple dummy variables. For the blood type group factor, three dummy variables encoding the blood types A, AB, and O, are added to the data to compare with blood type B (baseline). For the ethnic background factor, Black, Asian, Mixed, and other ethnic backgrounds (BAME) are added to the data to permit comparison to white Europeans (baseline).
| Risk-factor variable | Description | 
|---|---|
| sex | Participant sex. Binary variable 1 = male 0 = female | 
| age | Age of participant (at 2020 birthday). Numeric | 
| bmi | Body mass index. Numeric Where multiple longitudinal BMI measurements are available, the most recently recorded value is used. | 
| ethnic | Self-reported “ethnic group”. Categorical 1 = White, 1001 = British, 1002 = Irish, 1003 = Any other white background. 2 = Mixed, 2001 = White and Black Caribbean, 2002 = White and Black African, 2003 = White and Asian, 2004 = Any other mixed background. 3 = Asian or Asian British, 3001 = Indian, 3002 = Pakistani, 3003 = Bangladeshi, 3004 = Any other Asian background. 5 = Chinese. 4 = Black or Black British, 4001 = Caribbean, 4002 = African, 4003 = Any other Black background. 6 = Other ethinic group. -1 = Do not know. -3 = Prefer not to answer. | 
| other.ppl | Participant self-reports as “Other ethnic group”. Binary variable 1 = Yes 0 = No | 
| black | Participant self-reports as “Black or Black British”. Binary variable 1 = Yes 0 = No | 
| asian | Participant self-reports as “Asian or Asian British”. Binary variable 1 = Yes 0 = No | 
| mixed | Participant self-reports as “Mixed”. Binary variable 1 = Yes 0 = No | 
| white | Participant self-reports as “White”. Binary variable 1 = Yes 0 = No | 
| SES | Socioeconomic status (SES) using a Townsend deprivation index (Black 1988). Numeric For the population of a given area, a Townsend deprivation score is the summation of Z scores of four variables: unemployment, non-car ownership, non-home ownership and household overcrowding. A greater Townsend index score implies a greater degree of deprivation. Z scores = (percentage – mean of all percentages)/SD of all percentages. | 
| smoke | Pack-years of smoking. Numeric Where multiple longitudinal pack-years measurements are available, the most recently recorded value is used. Number of cigarettes per day/20 * (Age stopped smoking - Age start smoking) Note: Individuals who started and gave up smoking before 16 years of age were coded as NA. For individuals who started smoking before 16 but gave up after 16, their age start was set as 16. Individuals who reported starting and stopping smoking at the same age and reported giving up smoking for more than 6 months had pack-years set at 0. | 
| blood group | Participant blood type. Categorical Participants' blood groups were extracted from imputed genotyped data (Field 23165), which was added in July 2020 as a result of the suggestion that blood group may affect COVID-19 outcomes. Blood groups: AA, AB, AO, BB, BO, OO. | 
| O | Participant has O-type blood. Binary variable 1 = Yes 0 = No | 
| AB | Participant has AB-type blood. Binary variable 1 = Yes 0 = No | 
| B | Participant has B-type blood. Binary variable 1 = Yes 0 = No | 
| A | Participant has A-type blood. Binary variable 1 = Yes 0 = No | 
| inAgedCare | Evidence that the participant resides in an Aged Care facility. Binary variable. 1 = Evidence of residing in aged care, based on HES data (admitted from, or discharged to, a nursing, residential care, group home), or from the COVID-19 test data (requesting organisation). 0 = Any individual not having evidence for residing in aged care, as defined above. | 
Simple associations between COVID-19 phenotypes and these common risk factors may be examined using the log_cov function, which performs a logistic regression model and formats the results for quick interpretation.
The comorbidity_summary function summarises disease history records of each individual from the hospital inpatient diagnosis data. To meet different research aims the function allows restriction to a period and filtering of annotations by only primary diagnoses or all diagnoses (using the "Date.start", "Date.end" and "primary" arguments, respectively). For illustration, if we are interested in the co-occurrences of COVID-19, we can set the episode start date as 16 March 2020 (“Date.start = 16/03/2020”), when the first COVID-19 test result was recorded and choose to use all diagnoses (“primary = FALSE”). If we are interested in individuals with reported comorbidities that are at a higher risk to SARS-CoV-2, we can choose an episode start time before the COVID-19 outbreak in the UK, for example, “Date.end = 01/01/2020” and only focus on the primary diagnoses (“primary = TRUE”). Comorbidity categories are generated using the block categories in the ICD10 code, which is shown in the second column in Table 4. We include ICD10 chapters 1-14 and 17 and exclude several chapters such as pregnancy, childbirth, and consequences of external causes etc. For instance, the first category is “A00-A09”, representing intestinal infectious diseases. During a period restricted by the start and end dates, cases are defined as any participants who were diagnosed as any subclasses under the block A00‐A09 in the hospital inpatient diagnosis data. In this way, 164 binary variables are generated and each of them represents a comorbidity category. The R function generates a text file including all comorbidity categories, which can be used in the comorbidity association tests.
Comorbidity categories are generated using the block categories in the ICD10 code, as shown in the second column. We only included the blocks in chapter 1-14 and 17 and excluded several chapters such as pregnancy, childbirth and consequences of external causes etc.
The comorbidity_asso function performs association tests between each comorbidity category and the selected phenotype using logistic regression models and adjusts the tested phenotype with covariates, which can be set using the argument “cov.name”. By default, the covariates include sex, age, and BMI. Different ethnic backgrounds can be chosen for the test by setting the argument “population”. By default, all populations are included. It outputs a table comprised of odds ratios (ORs), confidence intervals (CIs) of ORs, and p-values for all the comorbidity categories.
The UKB.COVID19 package provides several functions, to facilitate GWAS, or other genetic analyses using the UKBB data. We provide two functions sampleQC and variantQC, to allow easy cleaning of the genetic data, using quality control (QC) metrics, supplied by UKBB (Bycroft et al., 2018). A third function, makeGWASFiles, outputs phenotype files, which may be used as input for the GWAS software packages PLINK (Purcell et al., 2007) and SAIGE (Zhou et al., 2018).
The sampleQC function outputs a csv file summarising sample-level QC metrics, as well as producing lists of IDs for inclusion and/or exclusion in downstream analyses. The function identifies individuals to be excluded from genetic analyses based on: 1) being excluded by UKBB, before imputation due to high heterozygosity or missingness (>5%), 2) sex mismatches between genetically predicted and recorded sex, 3) an apparent excess number of relatives in the UKBB cohort (≥ 10 relatives), 4) putative sex chromosome aneuploidy, 5) withdrawn consent. The user has the option of further restricting to individuals of “White British” ancestry (determined using genetic principal components), by using the ancestry argument. Finally, the user can specify whether they require inclusion/exclusion sample lists to be formatted for PLINK or SAIGE.
The variantQC function identifies variants to be included in downstream analyses, based on minor allele frequency (MAF) and imputation quality (INFO score), with thresholds specified by the user (defaults to MAF ≥0.001 and INFO ≥0.5). The function outputs list of variants passing these thresholds are in two formats, given the two types of SNP IDs available in the UKBB imputed genetic data release: 1) snpIncludeSNPIDs_minMaf0.001_minInfo0.5.txt contains the unique SNP identifiers; 2) snpIncludeRSIDs_minMaf0.001_minInfo0.5.txt contains the rsid or the reference panel marker ID (note these IDs are not guaranteed to be unique). The function also outputs a file containing IDs of the subset of SNPs, used by UKBB for calculating ancestry principal components (Bycroft et al., 2018). This subset of SNPs is suitable for analyses where a pruned set of independent SNPs are preferred, for example for calculation of a genetic relatedness matrix (GRM).
The makeGWASFiles function generates a phenotype file, suitable to be used in association analyses by either SAIGE or PLINK (Purcell et al., 2007) (File format specified by user). The function utilises the phenotypes data frame generated by the makePhenotypes function, with the user able to specify specific phenotypes. The output phenotype file also contains the first 20 ancestry principal components, and genotyping array, as these are likely to be required as covariates in any genetic analyses. The user can also specify additional covariates (e.g. those generated by the risk_factor function), to be outputted to the phenotype file. Finally, the user can choose to output phenotypes, only for the individuals passing all QC (using the output file from sampleQC function), or for all individuals.
We performed QC for the genotype data from UKBB using the sampleQC function, with the ancestry = “WhiteBritish” option, and the variantQC function, with thresholds MAF = 0.01 and INFO = 0.8. Phenotype files for SAIGE were generated using the makeGWASFiles function, containing all variables generated by the risk_factor function.
Using the output files from the sampleQC and variantQC functions, we filtered the directly genotyped data using PLINK (Purcell et al., 2007), and the imputed data using QCTool version 2. We then performed GWAS of all COVID-19 phenotypes using SAIGE (Zhou et al., 2018). Firstly, the null model was fitted for each phenotype with 20 ancestry procedure codes (PCs), genotypic array, and associated non-genetic risk factors as covariates, and we used the pruned subset SNPs to construct the GRM. Subsequently, genome-wide association testing was undertaken, using the filtered imputed data.
To assess the associations between non-genetic risk factors and COVID-19 phenotypes (including susceptibility, severity, and mortality), we employed multivariable logistic regression models using the ‘glm’ function from the R package stats. Each model adjusted for covariates such as age, sex, and BMI. The tested risk factors included socioeconomic status (SES), smoking status, blood type, ethnic background, and residence in aged care facilities. The logistic regression model for each risk factor was specified as follows: logit (COVID-19 phenotype) ~ risk factor + age + sex + BMI. Comorbidity associations were analyzed using similar multivariable logistic regression models, with COVID-19 phenotypes modeled as: logit (COVID-19 phenotype) ~ comorbidity category + age + sex + BMI + SES + smoking status + aged care status. Odds ratios (ORs) with 95% confidence intervals (CIs) were reported, and p-values were calculated to determine the significance of the associations.
To identify genetic variants associated with COVID-19 phenotypes, we performed GWASs using the SAIGE software. Principal component analysis (PCA) was performed to account for population stratification, and the first 20 principal components (PCs) were included as covariates in the analysis. Additionally, we adjusted for age, sex, BMI, SES, smoking status, residence in aged care facilities and genotypic array in the regression models. The association between each SNP and the phenotypes was tested using a logistic regression model, as follows: logit (COVID-19 phenotype) ~ SNP + age + sex + BMI + SES + smoking status + aged care status + genotypic array + PC1-20. To account for multiple testing, the Bonferroni correction was applied. Loci reaching the genome-wide significance threshold (p < 5×10−8) were considered significant. Manhattan plots and quantile-quantile (QQ) plots were generated to visualize the results using R package ggplot2. All analyses were carried out using R (version 4.0.5).
We applied the R package UKB.COVID19 to the data released in April 2021. The last records in the COVID-19 test results data, the death register data and the hospital inpatient data were recorded on 6 April 2021, 23 March 2021, and 5 February 2021, respectively. By default, the dates for susceptibility, severity and mortality studies were chosen as 6 April 2021, 5 February 2021, and 23 March 2021, accordingly.
By 6 April 2021, 77,222 UKBB participants had tested for COVID-19. Among these individuals, 16,562 received at least one positive test result and 60,660 received all negative results. First, we tested the associations between a positive test result (as a proxy for COVID-19 susceptibility), and age, sex, and BMI using multivariable logistic regression. The results (Table 5) show increased odds of a positive result in individuals of male sex (OR = 1.08, 95% CI = [1.04,1.11], p-value = 0.00007), with higher BMI (OR = 1.026, 95% CI = [1.0229,1.03], p-value <10−5) and with younger ages (OR = 0.939, 95% CI = [0.937,0.941], p-value <10−5). A possible reason for this result is that the older participants are less active and thus had less chance of being exposed to SARS-CoV-2.
Cases are defined as participants who received at least one COVID-19 positive test result. Controls are those who received only negative results. We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI. SES stands for socioeconomic status. Odds ratio (OR) and p-values (P) are provided.
Second, we tested each potential risk factor individually with adjustment of age, sex, and BMI. Several publications have already reported that blood type groups are associated with COVID-19 susceptibility (Zhao et al., 2020; Zietz, Zucker, and Tatonetti 2020), including genetic associations with the ABO blood group locus at 9q34.2 (The Severe Covid-19 GWAS Group “Genomewide Association Study of Severe Covid-19 with Respiratory Failure” 2020). People with blood type A have been consistently reported as being at a higher risk to SARS-CoV-2 and people with blood type O at lower risk (Zhao et al., 2020). Consistent with these results we find that compared with type B, individuals with blood type O are less susceptible to SARS-CoV-2 (OR =0.91, 95% CI = [0.86,0.97], p-value = 0.005) but we were unable to replicate the type A findings (p-value = 0.7).
Compared with white individuals, those who self-identified as Black (OR =1.38, 95% CI = [1.24,1.55], p-value <10−5), Asian (OR =1.88, 95% CI = [1.71,2.07], p-value <10−5) and other ethnic backgrounds (OR =1.33, 95% CI = [1.14,1.55], p-value =0.0004) have higher odds of testing positive for COVID-19. Individuals with a lower socioeconomic status (SES) are also at a higher risk of COVID-19 (OR = 1.041, 95% CI = [1.036,1.047], p-value <10−5). Smoking also contributes to COVID-19 susceptibility (OR =1.003, 95% CI = [1.002,1.004], p-value <10−5). People who are staying at an aged care home are at a significantly higher risk of COVID-19 (OR = 2.13, 95% CI = [1.87,2.43], p-value <10−5), which is in line with the aged care home outbreaks in the UK.
We only apply GWAS to the white British participants in the UKBB. Therefore, we performed non-genetic risk factor association tests again for self-reported “white” participants only. It shows that age, sex, BMI, SES, smoking, and if in an aged care home are associated with COVID-19 susceptibility in white British. Incorporation of the two array effects and the first 20 PCs, these risk factors are used to adjust susceptibility in the GWAS. The genome-wide significant COVID-19 susceptibility locus identified in our GWAS is 3p21.31 (Figure 1 and Table 6). The most statistically significant SNP is rs2771616 within the glycine transporter gene SLC6A20 (3p21.31, p-value = 3.36 × 10−9), followed by SNPs rs73062389 (3p21.31; SLC6A20; p-value =5.16 × 10−9) and rs73062394 (3p21.31; SLC6A20; p-value = 6.68 × 10−9) in strong linkage disequilibrium (LD) (r2 = 1 and r2 = 1) (Table 7). SLC6A20 encodes an amino acid transporter that interacts with ACE2, the main receptor that SARS-CoV-2 uses to gain entry into host cells (Elhabyan et al., 2020; Hoffmann et al., 2020). This locus has also been previously identified by other studies (The Severe Covid-19 GWAS Group “Genomewide Association Study of Severe Covid-19 with Respiratory Failure”, 2020), several meta-analyses of which have also made use of the UKBB COVID-19 data (Host Genetics Initiative, 2021). All genome wide significant GWAS hits with gene annotations are available in Table 7.

Sample size is 61,823. In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis). The significance level is presented in the y-axis. The red line indicates the threshold for genome-wide significance 5 × 10−8 while the blue line indicates the threshold for suggestive genome-wide significance 1 × 10−5. The light green dots are the genes of interest, which have been reported in other publications (Pairo-Castineira et al., 2021; “Genomewide Association Study of Severe Covid-19 with Respiratory Failure”, 2020), including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7. The susceptibility phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1–20. The genome-wide significant COVID-19 susceptibility locus identified is 3p21.31. The most statistically significant SNP is rs2771616 within the glycine transporter gene SLC6A20 (3p21.31, p-value =3.36 × 10−9), followed by SNPs rs73062389 (3p21.31; SLC6A20; p-value = 5.16 × 10−9) and rs73062394 (3p21.31; SLC6A20; p-value = 6.68 × 10−9) in strong linkage disequilibrium (LD) (r2 = 1 and r2 = 1).
The most genome-wide significant hits of COVID-19 susceptibility, hospitalisation and critical care genome-wide association studies.
The genome-wide significant hits of COVID-19 susceptibility, hospitalisation and critical care genome-wide association studies.
By 5 February 2021, 15,666 UKBB participants received positive COVID-19 test results. 2,104 individuals had been admitted to the hospital due to COVID-19, 1,129 of these individuals received critical care treatments and 1,010 received advanced critical care treatments. The risk factor association test results are presented in Tables 8 and 9 for all populations and self-reported white individuals, respectively. Compared to white individuals, Black, Asian, and other minority ethnic groups are at a higher risk of severe COVID-19. Age, sex, BMI, SES, and smoking are also positively associated with COVID-19 severity.
Cases of hospitalisation include participants who were admitted to hospital and whose primary diagnosis was COVID-19, received critical care treatments, or died from COVID-19. Controls are the rest of the participants who received positive test results. Cases of critical care phenotype include those who received critical care treatments due to COVID-19 or died from COVID-19. Cases of advanced critical care are defined as participants who received advanced critical care treatments or died from COVID-19. We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI. SES stands for socioeconomic status. Odds ratio (OR) and p-values (P) are provided.
Cases of hospitalisation include participants who were admitted to hospital and whose primary diagnosis was COVID-19, received critical care treatments, or died from COVID-19. Controls are the rest of the participants who received positive test results. Cases of critical care phenotype include those who received critical care treatments due to COVID-19 or died from COVID-19. Cases of advanced critical care are defined as participants who received advanced critical care treatments or died from COVID-19. We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI. SES stands for socioeconomic status. Odds ratio (OR) and p-values (P) are provided.
The results from the GWAS are shown in the quantile-quantile (Q-Q) plots and Manhattan plots in Figures 2–4. The tested phenotypes are adjusted by age, sex, BMI, SES, smoking, if in an aged care home, array, and PC1–20. The results show that the locus at 3p21.31 is genome-wide significantly associated with COVID-19 hospitalisation and critical care (Tables 6 and 7). Specifically, the most significant SNP for both COVID-19 hospitalisation and critical care GWASs is located in the gene LZTFL1 (rs35044562 in locus 3p21.31; p-value = 1.55 × 10−10 and p-value = 2.23 × 10−9, respectively). According to the Genotype-Tissue Expression (GTEx) project, LZTFL1 is widely expressed throughout the body and encodes a protein involved in protein trafficking to primary cilia, which are microtubule-based subcellular organelles acting as antennas for extracellular signals. In T lymphocytes, LZTFL1 participates in the immunologic synapse with antigen-presenting cells, such as dendritic cells (these cells prime T-lymphocyte responses) (Kaser 2020; Seo et al., 2011; Jiang et al., 2016).

Sample size is 11,974. In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis). The significance level is presented in the y-axis. The red line indicates the threshold for genome-wide significance 5 × 10−8 while the blue line indicates the threshold for suggestive genome-wide significance 1 × 10−5. The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7. The hospitalisation phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1–20. The result shows that the locus at 3p21.31 is genome-wide significantly associated with COVID-19 hospitalisation. The most significant SNP for both COVID-19 hospitalisation GWAS is located in the gene LZTFL1 (rs35044562 in locus 3p21.31; p-value = 1.55 × 10−10).

Sample size is 11,974. In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis). The significance level is presented in the y-axis. The red line indicates the threshold for genome-wide significance 5 × 10−8 while the blue line indicates the threshold for suggestive genome-wide significance 1 × 10−5. The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7. The critical care phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1–20. The result shows that the locus at 3p21.31 is genome-wide significantly associated with COVID-19 critical care. The most significant SNP for both COVID-19 critical care GWAS is located in the gene LZTFL1 (rs35044562 in locus 3p21.31; p-value = 2.23 × 10−9).

Sample size is 11,974. In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis). The significance level is presented in the y-axis. The red line indicates the threshold for genome-wide significance 5 × 10−8 while the blue line indicates the threshold for suggestive genome-wide significance 1 × 10−5. The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7. The advanced critical care phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1–20. No genome-wide significant signals were found.
By 23 March 2021, 16,465 UKBB participants received positive COVID-19 test results. Among these, 1,042 individuals died from COVID-19. We performed the same association tests for COVID-19 mortality as for susceptibility and severity. The results (Table 10) show that males have a much higher chance of dying from COVID-19 than females (OR = 1.89, 95% CI = [1.63,2.20], p-value <10−5), consistent with previously published results from independent cohorts (Peckham et al., 2020). The black ethnic group is at a much higher mortality risk from SARS-CoV-2 compared to white individuals (OR = 2.04, 95% CI = [1.38,2.94], p-value = 0.0002). Age, BMI, SES, and smoking are positively associated with COVID-19 mortality. People living in aged care homes are at a much higher risk of dying from COVID-19. For self-reported white individuals, age, sex, BMI, SES, smoking, and being in an aged care home are positively associated with COVID-19 mortality. Therefore, all these covariates were used to adjust the mortality phenotype for GWAS. However, no genome-wide significant signal was detected for this GWAS (Figure 5).
Cases of mortality include participants whose primary death cause is COVID-19. Controls are the rest of the participants who received positive test results. We tested sex, age and body mass index (BMI) in a multivariable model first and then tested each other factor individually by adjusting sex, age and BMI. SES stands for socioeconomic status. Odds ratio (OR) and p-values (P) are provided.

Sample size is 12,790. In the Manhattan plot, each point denotes a SNP located on a particular chromosome (x-axis). The significance level is presented in the y-axis. The red line indicates the threshold for genome-wide significance 5 × 10−8 while the blue line indicates the threshold for suggestive genome-wide significance 1 × 10−5. The light green dots are the genes of interest, including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1, HLA-G, CCHCR1, NOTCH4, ABO, OAS1, OAS2, OAS3, APOE, DPP9, TYK2, IFNAR2, TMPRSS2, ACE2, and TLR7. The mortality phenotype is adjusted by age, sex, body mass index, socioeconomic status, smoking, if in an aged care home, array, and PC1–20. No genome-wide significant signals were found.
We were interested in the co-occurrence of COVID-19 and comorbidities in individuals who had suffered from severe COVID-19. Therefore, we divided the hospital inpatient diagnosis records into before and after the COVID-19 pandemic using the date 16 March 2020, when COVID-19 testing commenced in the UK. We performed association testing for each comorbidity using logistic regression models and adjusted COVID-19 severity (if the patient received critical care treatments) by sex, age, BMI, SES, smoking and aged care status. Tables 11 and 12 list the top ten associated diseases with severe COVID-19 before and after 16 March 2020. respectively. From Table 12, we found that the common co-occurrence associated with COVID-19 are pneumonia, respiratory diseases, renal failure, metabolic disorders, hypertensive diseases, heart disease and other bacterial diseases. People who have ever had mental disorders, influenza and pneumonia, renal failure, respiratory diseases, bacterial, viral, or other infections, malignant neoplasms of lymphoid, haematopoietic and related tissue, or other blood diseases, tend to have severe symptoms after being infected by SARS-CoV-2.
We divided the hospital inpatient diagnosis records into before and after the COVID-19 pandemic using the date 16 March 2020, when COVID-19 testing commenced. We performed association testing for each comorbidity using logistic regression models and adjusted COVID-19 severity (if the patient received critical care treatments) by sex, age, body mass index, socioeconomic status, smoking and aged care status. To show the comorbidities in individuals who had suffered from severe COVID-19, we ranked the p-values before 16 March 2020 and listed the top 10 comorbidities.
We divided the hospital inpatient diagnosis records into before and after the COVID-19 pandemic using the date 16 March 2020, when COVID-19 testing commenced. We performed association testing for each comorbidity using logistic regression models and adjusted COVID-19 severity (if the patient received critical care treatments) by sex, age, body mass index, socioeconomic status, smoking and aged care status. To show the top 10 co-occurrence of COVID-19, we ranked the p-values after 16 March 2020 and listed the top 10 comorbidities.
Several publications have reported that the APOE e4 genotype is associated with COVID-19 susceptibility and severity (Numbers and Brodaty 2021; Kuo et al., 2020a, 2020b). APOE e4 is a known risk factor for dementia, which has been replicated many times (Liu et al., 2013; Safieh, Korczyn, and Michaelson 2019; Emrani et al., 2020). One explanation for people with APOE e4 being at higher risk of COVID-19 could be due to a higher risk of exposure, as these individuals are more likely to reside in care homes, which have suffered from high rates of infections. This is particularly likely to be the case in UKBB, where 47% of participants are older than 70 years old. To test this hypothesis, we performed GWAS tests with and without aged care status. The APOE e4 signal was genome-wide significant without aged care status but was gone after aged care status adjustment (Figure 6), suggesting that this finding is not robust and may be due to ascertainment bias.

a. COVID-19 susceptibility GWAS without care home status covariate adjustment. The model we used is: susceptibility ~ age + sex + BMI + PC1-20 + array + SNP. b. COVID-19 susceptibility GWAS with care home status covariate adjustment. The model we used is: susceptibility ~ age + sex + BMI + PC1-20 + array + inAgedCare + SNP. The APOE e4 signal was genome-wide significant without aged care status but was gone after aged care status adjustment, suggesting that this finding is not robust and may be due to ascertainment bias.
To demonstrate the functionality and utility of UKB.COVID19, we present a basic tutorial for using UKB.COVID19. Due to the restriction of using UKBB data, we illustrate the use cases using simulated data. The SAIGE GWAS script example can be found in Github: https://github.com/bahlolab/UKB.COVID19/tree/main/inst/GWAS.
Generating a covariate file. The risk_factor function in UKB.COVID19 can be used to generate a covariate file with established risk factors and risk factors of interest by specifying the field code in UKBB main data.
library (UKB.COVID19)
covar <- risk_factor (ukb.data=covid_example("sim_ukb.tab.gz"),
ABO.data=covid_example("sim_covid19_misc.txt.gz"),
hesin.file=covid_example("sim_hesin.txt.gz"),
res.eng=covid_example("sim_result_england.txt.gz"))
head (covar)
#> ID sex age bmi ethnic other.ppl black asian mixed white SES smoke blood_group O AB B A inAgedCare
#> 1 1 1 74 39.0947 1001 0 0 0 0 1 5.43719 0.000 AO 0 0 0 1 0
#> 2 2 1 58 25.3177 1001 0 0 0 0 1 2.10787 0.000 AO 0 0 0 1 0
#> 3 3 0 51 32.2349 1002 0 0 0 0 1 7.36321 25.625 AO 0 0 0 1 0
#> 4 4 0 56 21.7955 1001 0 0 0 0 1 5.62047 0.000 AO 0 0 0 1 0
#> 6 6 1 67 25.9823 1001 0 0 0 0 1 3.90245 0.000 OO 1 0 0 0 0
Generating COVID-19 susceptibility phenotype file with risk factors. In the output file, columns “pos.neg” and “pos.ppl” are the susceptibility phenotypes, which denote 1) UKBB participants with COVID-19 positive versus negative results 2) and participants with positive results versus all the other participants.
phe <- makePhenotypes (ukb.data=covid_example("sim_ukb.tab.gz"),
res.eng=covid_example("sim_result_england.txt.gz"),
death.file=covid_example("sim_death.txt.gz"),
death.cause.file=covid_example("sim_death_cause.txt.gz"),
hesin.file=covid_example("sim_hesin.txt.gz"),
hesin_diag.file=covid_example("sim_hesin_diag.txt.gz"),
hesin_oper.file=covid_example("sim_hesin_oper.txt.gz"),
hesin_critical.file=covid_example("sim_hesin_critical.txt.gz"),
code.file=covid_example("coding240.txt.gz"),
pheno.type = "susceptibility"),
#> [1] "965 participants got tested until 2021-04-05."
#> [1] "218 participants got positive test results until 2021-04-05."
#> [1] "There are 21 deaths with COVID-19. 20 of them primary death cause is COVID-19."
#> [1] "50 patients admitted to hospital were diagnosed as COVID-19 until 2021-04-05."
#> [1] "32 patients' primary diagnosis is COVID-19."
#> [1] "1 patients in hospitalisation with COVID-19 diagnosis but show negative in the result file. Modified their test results."
#> [1] "There are 219 COVID-19 patients identified. 32 individuals are admitted to hospital. 3 had been in ICU. 1 had been in advanced ICU."
#> [1] "Outputting file: ~/UKB.COVID19/extdata/results/phenotype.txt"
head (phe)
#> ID pos.neg pos.ppl
#> 1 1 1 1
#> 2 2 0 0
#> 3 3 0 0
#> 4 4 0 0
#> 5 5 0 0
#> 6 6 0 0
Performing association tests. The log_cov function performs association tests using logistic regressions. This is an example of association tests between COVID-19 susceptibility and three risk factors: sex, age and BMI.
log_cov(pheno=phe, covariates=covar, phe.name="pos.neg", cov.name=c("sex", "age", "bmi"))
#> Estimate OR 2.5 % 97.5 % p
#> (Intercept) -0.16475743 0.8480994 0.1954585 3.6381032 0.824991899
#> sex1 0.04207813 1.0429760 0.7644672 1.4215535 0.790121307
#> age -0.03080456 0.9696651 0.9519878 0.9876397 0.001009957
#> bmi 0.03625193 1.0369170 1.0076088 1.0667564 0.012568486
Generating a comorbidity summary file. The comorbidity_summary function scans all the hospitalisation records with a given time period and generates a text file. The following example is to generate a comorbidity summary file that includes all the primary and secondary diagnoses in the hospital inpatient data after 16 March 2020.
comorb <- comorbidity_summary (ukb.data=covid_example("sim_ukb.tab.gz"),
hesin.file=covid_example("sim_hesin.txt.gz"),
hesin_diag.file=covid_example("sim_hesin_diag.txt.gz"),
ICD10.file=covid_example("ICD10.coding19.txt.gz"),
primary = FALSE,
Date.start = "16/03/2020")
comorb[1:6,1:10]
#> ID A00-A09 A15-A19 A20-A28 A30-A49 A50-A64 A65-A69 A70-A74 A75-A79 A80-A89
#> 1 1 1 0 0 1 0 0 0 0 0
#> 2 10 0 0 0 0 0 0 0 0 0
#> 3 100 0 0 0 0 0 0 0 0 0
#> 4 1000 0 0 0 0 0 0 0 0 0
#> 5 101 0 0 0 0 0 0 0 0 0
#> 6 102 0 0 0 0 0 0 0 0 0
Performing association tests between COVID-19 phenotype and comorbidities. This is an example of association tests between COVID-19 susceptibility and all comorbidities. It shows NAs when fitted probabilities numerically 0 or 1 occurred in the logistic regression models.
comorb.asso <- comorbidity_asso (pheno=phe,
covariates=covar,
cormorbidity=comorb,
population="white",
cov.name=c("sex","age","bmi","SES","smoke","inAgedCare"),
phe.name="pos.neg",
ICD10.file=covid_example("ICD10.coding19.txt.gz"))
head (comorb.asso, 4)
#> ICD10 Estimate OR 2.5% 97.5% p
#> A00-A09 A00-A09 Intestinal infectious diseases 0.4722864 1.603657 0.756784 3.240022 0.199664372
#> A15-A19 A15-A19 Tuberculosis NA NA NA NA NA
#> A20-A28 A20-A28 Certain zoonotic bacterial diseases NA NA NA NA NA
#> A30-A49 A30-A49 Other bacterial diseases 1.2246077 3.402831 1.633209 6.978689 0.000873076
We developed an R package that can reproducibly analyse and produce input files for GWAS studies for COVID-19 traits, using the UKBB resource.
The R package can be easily applied to the frequently updated UKBB COVID-19 datasets, facilitating rapid analyses. By applying the R package to data released in April 2021, we found that age, BMI, SES and smoking are positively associated with COVID-19 susceptibility, severity and mortality. Males are at a higher risk of COVID-19 infection than females. People residing in aged care homes were also at higher risk, potentially because they have other pre-existing conditions, and may also have a higher chance of exposure to SARS-CoV-2. By performing GWAS, we replicated previous findings (Pairo-Castineira et al., 2021; Zeberg and Pääbo, 2020; “Genomewide Association Study of Severe Covid-19 with Respiratory Failure”, 2020; Host Genetics Initiative, 2021) that the locus 3p21.31 is associated with COVID-19 susceptibility and severity.
The COVID-19 Host Genetics Initiative brings together the human genetics community to generate, share, and analyse data to learn the genetic determinants of COVID-19 susceptibility, severity, and related outcomes. They have been performing large-scale meta-analyses using existing biobanks, including UKBB, and periodically provide updated releases of their results, making available genome-wide summary statistics, and providing an online browser for exploring the latest results (https://app.covid19hg.org/). We primarily advocate the use of these resources for exploring genetic associations with COVID-19 susceptibility and severity. However, we anticipate our R package will enable researchers to undertake more bespoke genetic analyses, using the most up to date UKBB COVID-19 data, to meet the aim of their studies. Such analyses may include adjusting for non-genetic risk factors or comorbidities, to explore mediators, polygenic risk score analyses, or Mendelian Randomisation studies.
Long COVID, also known as post-acute sequelae of SARS-CoV-2 infection, refers to a range of symptoms that persist for weeks or months after the acute phase of COVID-19 has resolved. These symptoms can include fatigue, shortness of breath, cognitive dysfunction, and various other systemic issues, significantly impacting the quality of life of affected individuals. The UKB.COVID19 package provides multiple functions to facilitate long COVID analysis. For instance, the ‘comorbidity_summary’ and ‘comorbidity_asso’ functions can be used to summarise potential long COVID symptoms and assess their associations with risk factors, such as age, sex and certain pre-existing conditions. Furthermore, researchers can focus on subsets of participants reporting persistent symptoms consistent with long COVID to investigate genetic risk factors using GWAS. These analyses hold promise for uncovering the biological underpinnings of long COVID and identifying potential therapeutic targets to alleviate its impact.
There are several limitations of UKBB COVID-19 data. First, UKBB is not a nationally or worldwide representative sample. The majority of participants are of white British ethnicity. UKBB participants were more likely to be older, to be female, and to live in less socioeconomically deprived areas than nonparticipants. Compared with the general population, participants were less likely to be obese, to smoke, and to drink alcohol daily and had fewer self-reported health conditions (Fry et al., 2017). Initiatives such as OpenSafely (Williamson et al., 2020), have aimed to examine risk factors for COVID-19 disease in an unascertained UK population, via electronic health records. These data, however, are not presently available for use by the wider research community, due to the possibility of re-identification of individuals. The recent OpenSafely flagship paper examined health records of over 17 million individuals in England, of whom 10,926 had a COVID-19 related death, and found that male sex, greater age and deprivation, and non-white ethnicities were major clinical risk factors for mortality. Despite the ascertainment of the UKBB, it is reassuring that these established risk factors are also associated with COVID-19 outcomes in this cohort.
Second, the UKBB COVID-19 dataset evolved as testing scaled up in line with the national testing strategy and thus COVID-19 data is also subject to ascertainment bias. UK testing was initially largely restricted to healthcare workers, and those individuals with symptoms in hospitals. A positive result in an individual not recorded as a healthcare worker was therefore a reasonable proxy for severe disease early on in the pandemic. Testing capacity subsequently increased to include more community testing under pillar 2 of the national strategy, and as of 27 April 2020, NHS England directed hospitals to test all non-elective patients admitted overnight, including asymptomatic patients. To maximise ascertainment of cases and to evaluate disease severity, SARS-CoV-2 testing data should be used in combination with linked medical records (i.e. hospital inpatient records and death records) as we have implemented in this package. More recently, UKBB has made primary care records available for COVID-19 research. These data not yet utilised by the UKB.COVID19 package, will further improve case identification. Nonetheless, there are likely to be many individuals in the UKBB who contracted COVID-19, in particular those with milder disease, who will not be captured by the available data.
The definition of COVID-19 susceptibility is supposed to be the status of people who get infected or not after exposure to SARS-CoV-2. However, exposure to SARS-CoV-2 is not easy to determine. Furthermore, not everyone has an equal chance of being exposed to SARS-CoV-2 (for example, exposure will vary by occupation), nor does everyone have the same likelihood of being tested, due to testing strategies, as noted above. Such data idiosyncrasies have the potential to distort associations, in observational studies, and also in genetic analyses through population stratification. This issue of ascertainment, or collider bias, in the context of COVID-19, is discussed at length by Griffith et al. (2020). Analyses using the UKBB data should therefore be undertaken and interpreted within the context of changing testing capacity, and other limitations regarding phenotype definitions.
We welcome further suggestions and improvements for this R package, which we hope will reduce the barrier to utilising the UKBB data for COVID-19 research.
All the datasets were obtained from UKBB.
To access the UKBB datasets, you need to register as a UKBB researcher (https://www.ukbiobank.ac.uk/enable-your-research/register). If you are already an approved UKBB researcher with a project underway and wish to receive these datasets for COVID-19 research purposes, you can register to receive these data by logging into the Access Management System (AMS) (https://bbams.ndph.ox.ac.uk/ams/resApplications).
How to apply for access to UKBB data: https://www.ukbiobank.ac.uk/enable-your-research/apply-for-access. See COVID-19 data (https://biobank.ndph.ox.ac.uk/showcase/exinfo.cgi?src=COVID19) for registration and access details and Resource 1758 (https://biobank.ndph.ox.ac.uk/showcase/refer.cgi?id=1758) for further information.
All genome wide significant GWAS hits with gene annotations are shown in Table 7.
UKB.COVID19 can be installed via CRAN using install.packages (“UKB.COVID19”).
UKB.COVID19 is maintained at https://github.com/bahlolab/UKB.COVID19.
Latest UKB.COVID19 source code is available from: https://github.com/bahlolab/UKB.COVID19.
Archived source code at the time of publication: http://doi.org/10.5281/zenodo.5174381 (Wang et al., 2021).
License: MIT (https://opensource.org/licenses/MIT).
This research was conducted using data from UK Biobank (www.ukbiobank.ac.uk), a major biomedical database.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - | 
| PubMed Central Data from PMC are received and updated monthly. | - | - | 
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Medical statistics, biostatistics, statistics, R programming.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational Biology
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: biostatistics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Medical Statistics / Biostatistics
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |||
|---|---|---|---|
| 1 | 2 | 3 | |
| Version 3 (revision) 26 Jul 24 | read | ||
| Version 2 (revision) 18 May 22 | read | ||
| Version 1 19 Aug 21 | read | read | |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)