Keywords
PostGWAS, pGWAS, GWAS, Meta-analysis
PostGWAS, pGWAS, GWAS, Meta-analysis
Genome-wide association studies (GWAS) have been used to identify genetic variants associated with specific traits or diseases of interest. One of the advantages of performing GWAS is that they do not require prior knowledge of the biological hypothesis underpinning the genetic machinery of the investigated trait. Many GWAS have revealed hundreds of common variants that are associated with various phenotypes, including common diseases.1 Edwards et al reported the first GWAS for age-related macular degeneration in 2005.2,3 In the last decade, many GWAS have been reported in scientific databases3 and these studies have revealed the presence of various single nucleotide polymorphisms (SNPs). For instance, GWAS Catalog reported 4865 publications and 247051 associations in 2021. These reported SNPs could be used to better understand the molecular mechanisms of common diseases and biological pathways of interesting traits.
Over the last decade, most GWAS have been known to detect the genetic signals of a single gene or a single genetic marker, i.e., interpreting genes based on detected signals (positions) on genomic coordinates using a specified Genome Assembly.4 However, most complex diseases that have been targeted by GWAS are known to be caused by multiple genes that could be influenced by many other factors. It is known that GWAS typically report SNPs as statistically significant when their associated p-values are less than 5e-08. Accordingly, GWAS might not detect genetic variants with low or moderate risk.5,6 Thus, the ability of GWAS to detect a significant genomic position depends on heritability on phenotype, minor allele frequency (MAF), and sample size. Traditional GWAS could therefore be faced with the challenge of not being able to detect variants that are associated with low disease risk, implying that traditional GWAS results are prone to unreliable findings due to their false negative results.7 Furthermore, GWAS might fail to detect a significant signal if the effect of a variant in another gene is not taken into consideration.5 It is also limited in identifying changes in genotype as a response to environmental changes, i.e., detecting the impact of genotype interaction with the environment.8 These limitations of GWAS occur primarily because of the undetected effect of gene polymorphism. More specific challenges of GWAS have been reported in many scientific papers.1,5,7,9-12 Given all of the reported limitations of GWAS, it is crucial to perform post-GWAS (pGWAS) analysis. The overall goal of pGWAS analysis is to use the result of the association between the genotype and phenotype (summary statistics), with the following objectives:
• Transferability of previous result,
• Identification of new significant functional variants, i.e. lead SNPs,
• Identification of novel disease susceptibility genes, genotype-phenotype associations, and biological pathway network, and
• Building a polygenic risk score using summary statistics.
So far, the most common approaches to perform pGWAS analysis include the following three approaches: (i) Single-variant approach; (ii) Gene-scoring approach; and (iii) Pathway-sub-network-based approach. However, pGWAS approaches could be categorized into many classes based on their usage (see section below). Also, there are several methods/tools for pGWAS reported in different articles.
In this article, we provide an overview to performing pGWAS analysis and discuss the challenges behind each method. Furthermore, we direct readers to key articles for each pGWAS method and present the overall issues in pGWAS analysis. Finally, we include a custom pGWAS pipeline to guide new users when performing their research.
GWAS have become an indispensable approach in providing insight into understanding genotype-phenotype associations of complex disease. While the design for GWAS experiments are well established as bench-work, the computational methods for the analysis of GWAS data are still evolving. The typical computational pipeline to analyse any GWAS data consists of two essential tasks: a) upstream GWAS analysis (classic GWAS methods), and b) downstream GWAS analysis. The latter step is known as pGWAS analysis.
The upstream GWAS analysis is a multi-step task resulting in a list of statistically significant SNPs. This step often starts by checking the quality control of raw GWAS data which is a crucial step for the task of pGWAS analysis. It is obvious that unclean data lead to unreliable results. Therefore, many parameters such as quality control per sample, relatedness, replicate discordance, SNP quality control, sex inconsistencies, and chromosomal anomalies should be checked.13,14 After obtaining raw data-sets with high quality scores, the next step is to report the statistically significant SNPs. Many statistical tests are available for this, including Chi-square test, Fisher’s exact test, Cochran-Armitage trend test, Odds ratio, Logistic regression, ANOVA, Transmission Disequilibrium test, Bonferroni correction, and many other methods.
Many free tools are available to perform GWAS upstream data analysis such as: Plink,15 PLATO,16 EIGENSOFT,17 and STRUCTURE.18 Several other tools are also available as packages within R software, the most popular open-source software for statistical computing.19
Genome wide significant threshold. This approach provides the statistics at a SNP level. This approach aims to score the association between SNPs and the target trait. Usually, the score defined by a value or Odd Ratio (OR), and its standard error (SE) gives an idea of the genotype’s effect on the phenotype. Effect strength is computed using OR. Also, scientists use p-value as a measure of how likely this effect is to occur by chance. Statistically, many researchers use p-value to evaluate the hypothesis that there is no statistical evidence for SNP-trait associations. However, this hypothesis will be rejected when p-value is less than a predetermined threshold that is adjusted for a multi-test. Many methods are used for multiple test p-value correction. These methods include Bonferroni correction and false discovery rate (FDR).20 However, some of these methods do not consider linkage disequilibrium (LD) and they may be too stringent. In general, a threshold of is acceptable for human association studies.21,22 This value has been computed using independent signals in genomes. However, this value is not absolute and it depends on many factors, including LD (and diversity), array type, whole genome sequencing or whole exome sequencing, position number in array or sequencing, imputation panels, and positions finally analyzed.23 For trans-ethnic populations, it is recommended to estimate the threshold based on population diversity and LD. For instance, analyzing 1000 Genomes data, the suggested significance thresholds were for Africa, for Europe, for Mixed America, for East Asia and for South Asia.24 Although, a recent study using an African population proposed a threshold of .25
Suggestive threshold. Some authors consider GWA threshold very stringent, so studies often include a “suggestive threshold”. This is superior to a genome-wide significant threshold, and values found in the literature are , or lower. Studies estimated on “independant LD block” from 1000 Genomes used for the Affymetrix and Illumina GWAS SNP panels, and for HapMap CEPH Utah and Yoruba populations.26 The code below contains an R command to select significant SNPs using a cutoff value of .
## library need : data.table #open files with fread function data.gwas<-fread("result.gwas") # head to obtain the header of gwas file head(data.gwas) # selected lines using threshold value of 5E-8 data.gwas[data.gwas$p.value<5*10**-8,] # obtained information about min pvalue data.gwas[which.min (data.gwas$p.value),]
Inflation factors. Population and cryptic relatedness can cause spurious associations in GWAS with p-values higher than random leading to false positive signals. Genomic control (GC) approach is extensively used to effectively control false positive signals. Genomic inflation factors () can be computed as the median of the resulting chi-squared () test statistics divided by the expected median of the chi-squared distribution.27 Refer to equation 1 below
Z (), and p-value can be used to compute inflation factor, using for , quantile of 1 - p-value at 1 degrees of freedom. The code below demonstrates how to compute inflation factors using the build-in R function that can be used to calculate value of quantile for a distribution.
# compute inflation factors ## use quantile function of chisq to p.value data.gwas$p.value.qchisq <- qchisq(data.gwas$p.value, 1, lower.tail=FALSE) ## computed lambda median(data, na.rm=TRUE)/qchisq(0.5, df)
Global visualisation of results. A common way to visualize GWAS results is the Manhattan plot. The Manhattan plot demonstrates the physical location of SNPs distributed by chromosome in the x-axis with the degree to which a SNP is associated with the target trait, i.e. p-value scaled by log10 in the y-axis. A Manhattan plot can be done in R software using the qqman package, which includes functions for creating Manhattan plots and q-q plots from GWAS results.28 The code below demonstrates how to create Manhattan plots, and q-q plots the qqman R package.
## do a qqplot and Manhattan plot using library qqman library('qqman') ## used function qqplot from qqman qqplot(data.gwas$p.value) ## used function manhattan from qqman manhattan(data.gwas, chr="chr", bp="bp", p="p.value",snp="rsid")
Local visualisation of results. Local visualisation can be done using the method described by Pruim et al.29 In this method, information of LD, genes, and previous results of GWAS can be added. Zoom version 2 offers a virtual analysis of local GWAS results. On the other hand, LocusTrack from the UCSC genome-browser adds more annotation compared to LocusZoom.30 Furthermore, BigTop31 is capable of providing three-dimensional visualisation of data using allele frequency as a third dimension. The code below demonstrates how to use LocusZoom to visualize SNPs considering LD information.
## use R to reformat your file to be used by locuszoom R -e "library(data.table); data.gwas<-fread('result.gwas'); data.gwas<-data.gwas[,c('chr','bp','bp','rsid','af','p.value')]; names(data.gwas)<-c('#CHROM', 'BEGIN', 'END', 'MARKER_ID', 'MAF', 'PVALUE'); write.table (data.gwas, file='gwas.epacts',row.names=F, col.names=T, sep='\t', quote=F)" ## use locus zoom to do a plot, locuszoom/bin/locuszoom --epacts gwas.epacts \ --delim tab --refsnp rssnp \ --flank 10000 --pop EUR \ --build hg19 --source 1000G_Nov2014\ -p rs7412 --no-date \
This approach considers the association between a trait and all SNPs within a predefined window around genes rather than each marker individually. In many cases, this approach is more powerful than traditional individual-SNP-based GWAS.
A gene score (GS) is a value given to a gene representing some measures related to a genetic trait. Thus, all the statistical summary values such as p-values and fold changes could be considered gene scoring values. For pGWAS analysis, GS is defined as the sum of all statistically significant alleles (i.e. the risk alleles) of the selected SNPs present in each individual under investigation.32 Various algorithms have been developed to calculate GCs based on GWAS summary statistics.33-35 It is a common approach to encode and adjust SNPs values prior to calculating GC during analysis.32 The process of SNP encoding aims to ensure that all SNPs are positively correlated with the outcome.32 One more essential aspect to consider while calculating GS is the effect size. Variation in the effect size reflects reduction of predictive power of GS.32 The GS method mostly used in pGWAS analysis is the gene-based p-value.
Gene level p-value. In many pGWAS pipelines, the first step in downstream GWAS analysis is to assign the SNPs to functional genomic features. The latter includes: coding genes, non-coding RNAs, 5’UTR, 3’UTR, proximal promoters, regulatory element, and enhancer elements.
Two common methods for assigning SNPs to their corresponding genes are GLOSSI36 and VEGAS.34,37 GLOSSI is available as an R package, while VEGAS is available as an online tool as well as a stand-alone tool to be run on a local machine. Besides VEGAS and GLOSSI, many pGWAS tools provide procedures to calculate genes’ p-values. For instance, ancGWAS provides four different methods to calculate the genes p-values: Simes, Smallest, Fisher, and Gwbon. Similar to ancGWAS, MAGMA provides three methods to calculate genes’ p-values. However, one of the MAGMA methods is similar to the VEGAS method. The other two methods are by considering either the smallest SNP p-value, or the highest SNP p-values.
This approach considers the fact that complex biological phenomena addressed by GWAS, including the molecular basis of the rare disease, often arise due to gene interaction rather than single gene effect. Therefore, scientists analyse GWAS based on the biological network theory to understand the disease-causing genes and mechanisms involved in traits and complex diseases, such as in rare diseases. This approach is based on the results obtained from the gene-based association test and provides a higher level of complexity by considering biological networks and/or genes ontology. Analysing GWAS based on biological networks allows us to capture biological interactions between various molecules such as proteins, functional DNA motifs, coding and non-coding RNA, as well as disease mechanisms and to consider epigenetic changes, including methylation states or other modifications (phosphorylation, acetylation, etc). Also, this approach aims to map genes that are associated with significant SNPs into known pathways/Gene Ontology terms. The result of this approach provides information about the over-represented pathways in a given set of genes/SNPs.
GWAS will often identify a number of SNPs in LD with each other as being associated with the phenotype. The lead SNPs are those with the most significant p-value – they may be causal or not. The other SNPs in this region may only have an association because they are in LD with the causal SNP or they may be independently associated. This GWAS limitation may be observed when integrating information of variability in allele frequency, SNPs in LD block and imputation.38,39 For instance, when simulating GWAS results using OR value of and allele frequency of , only % of the simulations demonstrated that the identified causal variants were not the most strongly associated variants.40 On the other hand, changing the previous parameters OR value of and allele frequency of %, resulted in % concordance between causal variants and lead SNPs.40 Therefore, changing the previous parameters OR value of and allele frequency of %, resulted in % concordance between causal variants and lead SNPs.40
Therefore, fine-mapping methods are used to identify causal variants that are associated with the target trait and the number of putative causal variants from GWAS data. The fine-mapping approach integrates summary statistics from GWAS data, LD and functional annotations. There are two fine-mapping methods: (i) heuristic method that penalizes regression models, and (ii) Bayesian fine-mapping methods.38,39
Lead SNPs. Lead SNPs are defined as independent SNPs that have reached a minimum p-value threshold, i.e. they are independent of each other at the LD threshold. It is common to measure LD as where the square of the correlation coefficient between any two indicator variables is .41 PLINK has implemented a function called’ld-clump’ that clumps independent SNPs.15 Moreover, FUMA tool42 identifies lead SNPs by double clumping method. The first clumping is used for clumping SNPs with p-value at genome-wide significant p-value, i.e. p-value and independent at . This first clumping function reports significant independent SNPs. The second clumping is of significant independent SNPs at 0.1 and reports lead SNPs. The code below demonstrates how to use the Plink tool to perform clumping to get lead SNPs.
## define lead snps using plink # --clump-p1 Significance threshold for index SNPs # --clump-p2 Secondary significance threshold for clumped SNPs # --clump-r2 LD threshold for clumping # --clump-kb Physical distance threshold for clumping # --clump-snp-field your snp field must be same than in your assoc file # --clump-field: p value header plink --bfile plinkbase --clump result.gwas --clump-snp-field rsid\ --clump-field p.value --clump-p1 0.001 --clump-p2 0.1 \ --clump-r2 0.1 --clump-kb 250 --out assoc
Heuristic fine-mapping approaches. These are used to identify potential causal SNPs.38 They are proposed to filter SNPs around lead SNPs considering the value of their pairwise correlation . They consider a hierarchical clustering technique to cluster all SNPs in a given region based on their pairwise . Another way is to use LD block by direct extraction of block and the selection of the same position on the block or by the visualisation representation of LD using LocusZoom29 or Haploview.43 Nonetheless, the described method is not a statistical approach to define putative causal variants, such as Bayesian methods or regression models. The code below demonstrates how to combine Plink tool and Haploview to perform heuristic fine-mapping.
### use haploview to plot your data #### haploview has also an web interface plink --keep list.ind -bfile $baseplink --recode tab --out fileres \ --chr chr --from-kb begin --to-kb end --maf 0.01 \ # execute Haploview in the previous range java -jar Haploview.jar -n -minMAF $cut_maf -missingCutoff 0.01 \ -pedfile fileres.ped -map fileres.map -png -blockoutput
Bayesian methods: framework. The Bayesian method is commonly used to identify causal variants in a predefined SNPs window containing number of SNPs. Knowing data () the Bayesian method aimed to maximize statistical model () using the following conditional probability
Users first define an initial number of causal variant (between SNPs). This number is defined using genome-wide significant SNPs. To model Bayesian statistics, software will define a model that contains SNPs. This model often will restrain choice to the significant SNPs or suggest significant SNPs () in the windows. Considering the rule of combination , the probability model is relatively easier to compute. The following Bayesian rules can be used
Most of the tools are aimed to find a maximum probability to have the best combination of the causal variants. To model Bayesian statistics, many tools integrate various data such as LD and GWAS summary statistics.
Bayesian methods: posterior inclusion probability. This approach is used to compute post inclusion probability (PIP) at each SNP . The PIP is computed by the sum of the posteriors over all models that include SNP as a causal variant.38
Using the rank of PIP is an excellent way to select putative causal variants.38 The PIP approach should be used with caution to identify causal variants in the high LD regions. Therefore, it is recommended to estimate the posterior expected number of causal SNPs by summing the estimated PIPs for all SNPs in the region.
Bayesian methods: credible sets. This approach is used to define a set of variants that could have a good candidate. One way to estimate a credible set is to use PIP by ranking the values and doing the cumulative sum of PIP from the largest to the smallest. Then select all variants where the sum is less than the predefined cutoff value. In general, researchers use 99% or 95% as a recommended value for . Studies have shown that credible set and coverage probabilities are over-conservative in most fine-mapping situations as data sets are not randomly selected from among all causal variants. Therefore, an adjusted coverage is proposed to reduce such over-conservation in this approach.44 The code below demonstrates how to combine Plink tool and FINEMAP to perform Bayesian-based fine-mapping analysis.
# extract a range of interest plink -bfile $plk --keep-allele-order --chr chr --from-kb begin --to-kb end --make-bed -out plink_range #compute LD plink --r2 square0 yes-really -bfile plink_range -out ld_range # format ld sed 's/\t/ /g' tmp.ld > $outld ## Finemaping #create a config file echo "z;ld;snp;config;cred;log;n_samples" \ > fileconfig.finemap echo "$filez;$ld;${out}.snp;${out}.config;${out}.cred;${out}.log ;${params.n_pop}" >> $fileconfig FINEMAP --cond --in-files fileconfig --log \ --cond-pvalue 0.0000001 --n-causal-snps $ncausalsnp ### caviarbf caviarbf -z ${filez} -r $ld -t 0 -a ${params.caviarbf_avalue} \ -c $ncausalsnp -o ${output} -n ${params.n_pop} nb=`cat ${filez}|wc -l ` modelsearch -i $output -p 0 -o $output -m \$nb
Bayesian methods: trans ethnic fine-mapping. This is used to perform trans-ethnic fine-mapping studies using simulation results in similar fine-mapping resolution among the European and Asian ancestries. However, the inclusion of samples with African ancestry in meta-analysis leads to a significant improvement in fine-mapping resolution due to the lowest LD in the African ancestry population. The probability that the lead GWAS variants were also the causal variants increased using trans-ethnic GWAS data.40,45 It is a process that relies on disparate LD patterns in populations of diverse genetic ancestries to localize the causal variants. This approach has been successfully implemented to fine-map and leads to several common GWAS findings.46
Integrating annotation into fine-mapping. Functional annotations have been shown to improve the discovery power and fine-mapping accuracy. Therefore, some tools such as CausalDB,47 PAINTOR48 and BIMBAM49 have integrated expression quantitative trait locus (eQTL) information in fine-mapping approach.
Bayesian methods: software. Several Bayesian-based fine-mapping tools have been developed using summary statistics, LD and eQTL (Table 1). Also, some pipelines have integrated different Bayesian-based software in order to compare their results (fine-mapping of h3agwas, fine-mapping in FinnGen, and FM pipeline).
More descriptives can be found in.38
Software | Input | Output | Reference |
---|---|---|---|
FINEMAP | sumstat: beta, se, LD, n causal | PIP and Bayesian | Web Page50 |
CaviarBF | sumstat: value, LD, eQTL, fixed causal | PIP and Bayesian | Web Page47 |
PAINTOR | sumstat: value, LD, eQTL, fixed causal, multi LD | PIP and Bayesian | Web Page48 |
CAVIAR - eCAVIAR | sumstat, LD, eQTL fixed causal | probability and confidence set | Web Page51,52 |
Other fine-mapping approaches. Other approaches, such as regression models, are used with all SNPs in the lead SNPs region to analyze SNPs jointly. The comparison of various approaches including elastic net, ridge, Lasso, MCP, and the normal-exponential shrinkage prior, have shown that penalized methods outperform single marker analysis.53 Furthermore, a forward/stepwise regression can be used to test the independence of multi SNPs using the following algorithm:
• Order the list of SNPs by their p-values
• Remove highest p-value
• Apply models with with as a co-variate and check if any is significant
• Repeat process.
The R library SusieR has implemented a method of regression fine-mapping analyses.
Other resources for fine-mapping analyses. Several review articles have been published for fine-mapping analyses. Nevertheless, we recommend the following scientific articles as a good source for beginners: “From genome-wide associations to candidate causal variants by statistical fine-mapping”,38 “A practical view of fine-mapping and gene prioritization in the post-genome-wide association era”,39 and “Fine-mapping genetic associations”.54
The aim of performing conditional association and imputation using summary statistics is to evaluate the association between SNPs and biological trait by combining various GWAS summaries from different studies. This method requires a reference population to estimate LD information. The imputation method performs meta-analysis to infer the missing genotypes among the studies before evaluating the association between the SNPs and the biological trait. This method estimates the effects of many variants that are not directly genotyped.
This method is used to predict disease risk using GWAS summaries.55 Polygenic risk score (PRS) could be used to predict an individual’s likelihood to develop a specific trait or to estimate the level of predictive power that the trait is associated with a particular set of variants.56 Although PRS methods are classified into Bayesian-based methods and non-Bayesian methods, there are more classifications of underlying PRS methods.57 PRS is calculated by aggregating effects from a large set of causal SNPs. Several tools were developed to calculate PRS. For performing PRS studies, we highly recommend this recent review paper.58 PRS is usually computed after the challenges associated with GWAS are carefully addressed. Here, we demonstrate key quality control (QC) measures and include sample bash scripts. Table 2 summarizes the seven QC measures and contains the guidelines on specific thresholds. Thresholds can differ depending on the study’s unique features.
GWAS effect allele. As datasets for PRS come from different GWAS experiments, it is critical to ensure consistency. Knowing which allele is considered the effect allele, it is vital to get an accurate PRS score. However, the effect allele is not labeled clearly in many datasets.59 Different allele coding schemes exist, including Illumina’s TOP/BOTTOM coding concept, ALT/REF, effect/other, HapMap’s forward allele coding, Illumina’s A/B allele coding, Affymetrix’s A/B allele coding, REF (reference)/ALT (alternative), PLINK’s 1/2 allele coding, A1 (allele1)/A2 (allele2), A0 (allele 0)/A1 (allele1), effect allele/non-effect allele, effect allele/other allele, and many others.58,59 To avoid allele inconsistency, researchers should carefully read the documentation of GWAS datasets.
SNPs level errors. The QC assessment at the SNP level is crucial to avoid misleading PRS. SNPs level errors include (i) mismatching SNPs, i.e., inconsistent SNPs due to position difference in genomic position or nucleotide type, (ii) existence of duplicate SNP, (iii) ambiguous SNPs, i.e., researchers have no idea about SNP strand (ambiguous SNPs usually are C/G or A/T SNPs), and (iv) missing alleles.
Chip heritability. The chip heritability ()is also known as SNP-based heritability, which is defined as the portion of the phenotypic variation that the genotyped genetic marker can explain.60,61 Higher values of heritability indicate that the phenotype is explained best by the genotype, i.e. set of SNPs. Choi et al.58 recommend to perform PRS analysis. To estimate , researchers should use LD score regression that could be used to distinguish polygenicity (SNPs effects) and confounding biases, including cryptic relatedness and population stratification.62
The code below demonstrates how to use Plink to perform quality control checks and calculating PRS.
# $bfile: Target data set in Plink binary format. # $QC: Base data set i.e., base GWAS summary statistic. This file contains P-value information # $p1: P-value threshold for a SNP to be included as an index SNP. Choose value of 1 if you want to ##include all SNPs are include for clumping. # $r2: Cutoff for r2 value i.e., SNPs having value higher than given r2 will be removed # $kb: cutoff value window size in kilobase, i.e., SNPs within $kb of the index SNP are considered for clumping. # $SNP: Name the column SNP that containing the SNP IDs # $P: Name of the column that containing the P-value information # $Output: Output file ################### Quality Control of Target Samples plink\ --bfile $bfile \ --maf 0.05 \ --mind 0.1 \ --geno 0.1 \ --hwe 1e-6 \ --make-just-bim \ --make-just-fam \ --out $Output.qc ################### Clumping plink \ --bfile $bfile \ --clump-p1 $p1 \ --clump-r2 $r2 \ --clump-kb $kb \ --clump $QC \ --clump-snp-field $SNP \ --clump-field $P \ --out $Output ############### calculating PRS # $listpvalue: Threshold values # $valid.snp: valid SNPs # Here we are assuming that 3 column for SNP ID; 4 for effective allele information; ### the 12 for effect size estimate. #### Moreover, this file has a header. plink \ --bfile $bfile \ --score $QC 3 4 12 header \ --q-score-range $listpvalue \ --extract $valid.snp \ --out $Output
The meta-analysis approach can be used to evaluate the association between SNPs and biological traits by combining various GWAS summaries from different studies. The following paragraphs will provide the key concepts for performing a meta-analysis. To have concrete information about the meta-analysis approach, we recommend this review article.63
Heterogeneity of source. Heterogeneity in data could be derived using GWAS summary statistics. The standard variables to estimate heterogeneity in data include odds ratios, standardized effect sizes, other metrics along with their uncertainty (e.g. variance or 95% confidence interval) and the accompanying p-values. However, there might be many other variables for each dataset that are important to deal with in order to estimate heterogeneity in data.
Standard meta-analysis. This is used to perform the meta-analysis approach which is to sum the Z-scores across all studies and weigh them appropriately using the sample sizes. See equation (3) below.
where is -score from study, and weight of studies relative to population size.
Independence of the samples. Conventional meta-analysis has an assumption that assumes that effect sizes are independent. Simulation studies demonstrate that failure to account for overlapping samples could greatly inflate type I error.64 If accounting for the overlap is unavoidable, the overlap/covariance can be estimated using Z’s covariance between summary statistics. Some tools can account for overlapping samples, such as METAL software65 and ASSET.66
Correcting for population structure with genomics control. The presence of population structure in the GWAS study can impact an over-dispersion of the corresponding association test statistics. One approach to limit this problem is to correct statistics of each summary using genomic control. This correction factor is given as the inflation factor () which is the test statistics’ median divided by its expectation under the null hypothesis.67
p-values versus Z scores. Meta-analysis methods based on p-values were widely used in different scientific fields until the . Then became unpopular and almost abandoned in biomedical sciences. Nowadays, the meta-analysis approach is performed using Z-score. There are two methods to estimate Z-score for GWAS data. The first method is demonstrated in equation (4).
In the second method (equation (5)), Z-score is estimated using p-value and the effect of allele.
where is a sign of relation.
Random effects versus fixed effects. In the presence of variability of allelic effect as in trans-ethnic studies, it is common to perform a random-effect meta-analysis to correct variability of effect between different studies. For instance, GWAMA tool68 computes a random-effects variance component using Cochran’s statistic (Q-value) to balance weight used in meta-analysis. On the other hand, Metasoft69 proposed two other different methods to take into account the heterogeneity, which are Random Effects model70 and binary effects model with m-value, i.e. weight of each summary study in summary statistics.71
Heterogeneity test. Heterogeneity at a locus can be reflected in the variability in population or environment. It can be relevant to gene-environment interaction and the reason behind the variability in GWAS approach of each data-set, i.e. covariable, model and approximation of GWAS between summary statistics. The heterogeneity is computed between two sets of summary statistics rather than one locus.
Cochran’s statistic provides a test of heterogeneity of allelic effects at SNPs using equation (6) below.
where denotes study number.
Alternatively, we use Q statistic to quantify the extent of heterogeneity in allelic effects across studies.72 See equation (7) below
Other meta-analysis approaches. Traditionally, meta-analyses of GWAS have focused on combining results of multiple studies for similar traits. The Bayesian framework has been tested to estimate on different phenotypes.73 MetABF tool74 has implemented a method to perform meta-analysis across genome-wide association studies of diverse phenotypes. It is important to note that a recent review on cancer suggests that it is possible to obtain “noteworthy” Bayesian results at higher p-values that are not considered statistically significant in GWAS.75
Study alignment and error trapping. Meta-analysis aggregates various summary statistics. Therefore, any error to designate the effect allele and other allele or strand issue can cause an error in estimate . Such errors might lead to misleading meta-analysis results as it increases of Cohran and heterogeneity between studies.
Effect size. By conducting a meta-analysis, researchers often neglect the sample size variation among different studies “true effect sizes are the same across studies”. However, in some cases, researchers introduce the correct effect size by considering the posterior probability for each study. Some software for estimating effect size are given in Table 3.
Software | specificity | remarks | link and publication |
---|---|---|---|
GWAMA | FE, RE, GC | short manual | Web Page50 |
Meta-Soft | FE, RE, RE2, FE2 and BE, Q, GC | R script to plot effect of study | Web Page70 |
MR-MEGA | FE, RE, Q, | manual limited | Web Page76 |
METAL | FE, RE, Q, , SOC, p-value, GC | Web Page65 | |
PLINK (1.9) | FE, RE | Few options described | Web Page15 |
Meta-analysis output. Meta-analysis provides a new set of summary statistics. For each position that has not been discarded, new statistics will be calculated. These statistics include new values for , and p-value. Users should be aware of the LD and the reference population used.
More resources for meta-analysis. We recommend h3bionet/h3agwas for meta-analysis pipeline. In addition, those interested in meta-analysis are directed to a review published by Zeggini et al.77 and another review article by Evangelou and Ioannidis.63 The code below demonstrates how to use the Metasoft tool to perform meta analysis.
# example with metasoft ## reformatting input files. ### File 1 R -e "library(data.table);data.gwas<-fread('$File 2'); data.gwas<-data.gwas[,c('CHR','SNP','BP','ALLELE1', 'ALLELE0','BETA','SE','P_BOLT_LMM')]; names(data.gwas)<-c('CHR', 'SNP','BP','A1', 'A2','BETA', 'SE','P'); data.gwas<-data.gwas[data.gwas[['P']]<1]; write.table(data.gwas, file='gwas_file2.qassoc', row.names=F, col.names=T, sep='\t', quote=F)" ### File 2 R -e "library(data.table);data.gwas<-fread('$File 1'); data.gwas<-data.gwas[,c('CHR','SNP','BP','ALLELE1', 'ALLELE0','BETA','SE','P_BOLT_LMM')]; names(data.gwas)<-c('CHR', 'SNP','BP','A1', 'A2','BETA', 'SE','P'); data.gwas<-data.gwas[data.gwas[['P']]<1]; write.table(data.gwas, file='gwas_file1.qassoc', row.names=F, col.names=T, sep='\t', quote=F)" ## metasoft ### Metasoft with binary effect java -jar Metasof/Metasoft.jar \ -input file_merge_all.meta\ -output meta.meta \ -pvalue_table Metasof/HanEskinPvalueTable.txt\ -binary_effects\
Colocalization. Colocalization is an approach used to integrate annotations with GWAS results. The annotations resources include gene expression (eQTLs), protein expression (pQTLs), exon splicing (sQTLs), DNA methylation (mQTLs), and chromatin acetylation and chromatin accessibility (caQTLs).78
Statistics for colocalization studies. Several parametric and non-parametric statistics can be done for colocalization studies.79
Resources for colocalization studies. We recommend the following scientific resources for those who are beginners in this field:
• From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases.78
• Colocalization analyses of genomic elements: approaches, recommendations and challenges.79
• Bayesian test for colocalisation between pairs of genetic association studies using summary statistics.80
• LocusFocus: web-based colocalization for the annotation and functional follow-up of GWAS colocalization of GWAS and eQTL signals detects target genes.81
• A powerful and versatile colocalization test.82
Using summary statistics from multiple phenotypes and traits. The methods for GWAS are mostly focused on single variant analysis with a single phenotype or trait. Increasing evidence shows that pleiotropy, one gene’s effect on multiple phenotypes, plays a pivotal role in many complex traits. Therefore, associating different GWAS results for multiple phenotypes can provide an extensive power by aggregating multiple weak signals.83 Different approaches have been developed to integrate dependent p-values to assess the association between a gene and multiple correlated phenotypes.83 Several tools exist to perform multi traits analysis, including Multi-Trait Analysis of GWAS (MTAG)84 and CPASSOC package.85
Mendelian randomisation (MR) is a statistical approach that can be defined as “the use of genetic variants as instrumental variables to investigate the effects of modifiable risk factors for disease”.86 For instance, one trait (phenotype or disease) might be affected by confounding or reverse causation rather than a conventional observational variable. Therefore, MR aims to provide a statistical frame to verify the causality between locus and phenotype and exclude pleiotropy. Such methods will provide a reliable explanation of the results.
Assumption of MR. There are three main assumptions for MR.87-90 These are (i) the genetic variant that is associated with the exposure (significant association), (ii) the genetic variant that is independent of the outcome given to the exposure and all confounders (measured and unmeasured) of the exposure-outcome association, (iii) the genetic variant that is independent of factors (measured and unmeasured) that confound the exposure-outcome relationship.
Statistical methods for MR. MR general strategy is to compare values or p-value of the same position of two or more different GWAS using related/confounding phenotype. Several methodologies have been developed for MR analysis, an example is the ratio of coefficients estimator, which can be modeled using equation (8):
where represents the change in exposure per variant allele, and represents the change in outcome per variant allele. Another model is the two-stage least squares. It employs a two-stage regression approach with two regression models where the first stage regression’s output is used as the input of the second stage regression. More methods for MR exist, including control function estimator, limited information maximum likelihood method, verse variance weighted method, and MR-Egger method.91
MR software. Several tools have been developed to compute MR, e.g GSMR (Generalised Summary-data-based Mendelian Randomisation) from GCTA, TwoSampleMR (version 0.4.20), MR-PRESSO, and MR-LDP which is integrated LD information. We recommend the following tutorial: https://bioconductor.org/packages/release/bioc/vignettes/GMRP/inst/doc/GMRP.pdf.92
The code below demonstrates how to use gcta64 tool to perform Mendelian randomisation analysis.
#reformat file 1 R -e "library(data.table); data.gwas<-fread('$File'); data.gwas<-data.gwas[,c('SNP','ALLELE1','ALLELE0','A1FREQ','BETA','SE','P_BOLT_LMM')]; names(data.gwas)<-c('SNP','A1', 'A2','freq','b', 'se','p'); data.gwas[['N']]<-10000; data.gwas<-data.gwas [data.gwas[['p']]<1]; SNPtmp<-table(data.gwas[['SNP']]); UniqSNP<-names(SNPtmp)[SNPtmp==1]; write.table(data.gwas[data.gwas[['SNP']] %in% UniqSNP],file='FilePheno1',row.names=F, col.names=T, sep='\t', quote=F)" #reformat file 2 R -e "library(data.table); data.gwas<-fread('$File'); data.gwas<-data.gwas[,c('SNP','ALLELE1','ALLELE0','A1FREQ','BETA','SE','P_BOLT_LMM')]; names(data.gwas)<-c('SNP','A1', 'A2','freq','b', 'se','p'); data.gwas[['N']]<-10000; data.gwas<-data.gwas[data.gwas[['p']]<1]; SNPtmp<-table(data.gwas[['SNP']]); UniqSNP<-names(SNPtmp)[SNPtmp==1]; write.table(data.gwas [data.gwas[['SNP']] \%in% UniqSNP], file='FilePheno2',row.names=F, col.names=T, sep='\t', quote=F) # create echo "P1 FilePheno1" > exposure echo "Exp FilePheno2" > outcome ## gcta need a plink file or bgen file ## GSMR analyses, ###forward-GSMR analysis (coded as 0), ###reverse-GSMR analysis (coded as 1) ###and bi-GSMR analysis (both forward- and reverse-GSMR analyses, coded as 2). gcta64 --bfile plkfile --gsmr-file exposure outcome \ --gsmr-direction 0 --out test_gsmr_result
MR and gene expression. Recently, some methods have integrated MR into GWAS and eQTL to test if the effect of gene expression is zero on the trait.93-95 These methods provide a promising way to combine GWAS summary statistics and expression data.
Our proposed pGWAS pipeline consists of three main steps: preprocessing, visualization, and the downstream pGWAS analysis (refer to Figure 1).
Step 1, the preprocessing step, aims to control checks and ensure a correct input file format for the downstream pGWAS analysis. The main purpose of this step is to ensure that SNPs’ positions in the GWAS summary file accurately match the genomic coordinates in the downstream reference panel if any is available. The UCSC LiftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver) is widely used to correct genomic position mismatches between the GWAS summary file and the reference panel. However, other options exist, including: Bioconductor rtracklayer package,96 Assembly Converter,97 NCBI Remap,98 and the CrossMap tool.99
Step 2, the visualization step, aims to visualize the raw input GWAS summary data pictorially, primarily through two scatter plots: Manhattan plot and quantile-quantile (Q-Q) plot. The Manhattan plot is widely used in genomics to visualize the results of GWAS studies. In the Manhattan plot, the X-axis represents the positions on chromosomes, while the Y-axis reflects genomic association strength with a given trait. The Q-Q plot is used to check the normality of data -mainly the normality of p-values distribution. Step 2 can be completed using the qqman R package.100
Step 3, of downstream pGWAS analysis, can be divided into three approaches based on their underlining data heterogeneity. First, if there is homogenous data, researchers can perform a single variant pGWAS analysis such as to test the association between a particular variant and trait. Furthermore, researchers can perform gene set analysis or network/pathway analysis to understand the biological function underlying a list of statistically significant variants. For instance, the MAGMA tool can be used to conduct gene set-based pGWAS research,101 while pathway analysis could be done using the PASCAL (Pathway scoring algorithm) tool.102 Researchers can also undertake fine-mapping analysis and MR for homogenous GWAS summary files. Second, researchers can perform PRS analysis and genetic risk score if individual-level data is available. Third, if there is heterogeneous data from numerous independent studies, a meta-analysis can be performed.
This articles demonstrates various pGWAS methods. The advancement in these pGWAS techniques solves a significant problem in our efforts to understand the vast amount of data generated and explore fundamental biology. However, several issues should be taken into consideration when performing pGWAS analysis across trans-ethnic GWAS studies. Some of these issues include: (i) heritability of the trait, (ii) GWAS sample size, (iii) polygenicity of the traits, (iv) genetic architecture of the trait, and (v) genotype-environments interactions. Furthermore, we expect that in the future many pGWAS methods will be developed to address these limitations either for a particular ethnic group or for multi-ethnic groups.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
References
1. Seren Ü: GWA-Portal: Genome-Wide Association Studies Made Easy.Methods Mol Biol. 2018; 1761: 303-319 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: plant ecology and genetics
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: statistical genomics, animal breeding
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 04 Oct 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)