Performing post-genome-wide association study analysis: overview, challenges and recommendations

Genome-wide association studies (GWAS) provide huge information on statistically significant single-nucleotide polymorphisms (SNPs) associated with various human complex traits and diseases. By performing GWAS studies, scientists have successfully identified the association of hundreds of thousands to millions of SNPs to a single phenotype. Moreover, the association of some SNPs with rare diseases has been intensively tested. However, classic GWAS studies have not yet provided solid, knowledgeable insight into functional and biological mechanisms underlying phenotypes or mechanisms of diseases. Therefore, several post-GWAS (pGWAS) methods have been recommended. Currently, there is no simple scientific document to provide a quick guide for performing pGWAS analysis. pGWAS is a crucial step for a better understanding of the biological machinery beyond the SNPs. Here, we provide an overview to performing pGWAS analysis and demonstrate the challenges behind each method. Furthermore, we direct readers to key articles for each pGWAS method and present the overall issues in pGWAS analysis. Finally, we include a custom pGWAS pipeline to guide new users when performing their research.


Introduction
Genome-wide association studies (GWAS) have been used to identify genetic variants associated with specific traits or diseases of interest. One of the advantages of performing GWAS is that they do not require prior knowledge of the biological hypothesis underpinning the genetic machinery of the investigated trait. Many GWAS have revealed hundreds of common variants that are associated with various phenotypes, including common diseases. 1 Edwards et al reported the first GWAS for age-related macular degeneration in 2005. 2, 3 In the last decade, many GWAS have been reported in scientific databases 3 and these studies have revealed the presence of various single nucleotide polymorphisms (SNPs). For instance, GWAS Catalog reported 4865 publications and 247051 associations in 2021. These reported SNPs could be used to better understand the molecular mechanisms of common diseases and biological pathways of interesting traits.
Over the last decade, most GWAS have been known to detect the genetic signals of a single gene or a single genetic marker, i.e., interpreting genes based on detected signals (positions) on genomic coordinates using a specified Genome Assembly. 4 However, most complex diseases that have been targeted by GWAS are known to be caused by multiple genes that could be influenced by many other factors. It is known that GWAS typically report SNPs as statistically significant when their associated p-values are less than 5e-08. Accordingly, GWAS might not detect genetic variants with low or moderate risk. 5,6 Thus, the ability of GWAS to detect a significant genomic position depends on heritability on phenotype, minor allele frequency (MAF), and sample size. Traditional GWAS could therefore be faced with the challenge of not being able to detect variants that are associated with low disease risk, implying that traditional GWAS results are prone to unreliable findings due to their false negative results. 7 Furthermore, GWAS might fail to detect a significant signal if the effect of a variant in another gene is not taken into consideration. 5 It is also limited in identifying changes in genotype as a response to environmental changes, i.e., detecting the impact of genotype interaction with the environment. 8 These limitations of GWAS occur primarily because of the undetected effect of gene polymorphism. More specific challenges of GWAS have been reported in many scientific papers. 1,5,7,[9][10][11][12] Given all of the reported limitations of GWAS, it is crucial to perform post-GWAS (pGWAS) analysis. The overall goal of pGWAS analysis is to use the result of the association between the genotype and phenotype (summary statistics), with the following objectives: • Transferability of previous result, • Identification of new significant functional variants, i.e. lead SNPs, • Identification of novel disease susceptibility genes, genotype-phenotype associations, and biological pathway network, and • Building a polygenic risk score using summary statistics.
So far, the most common approaches to perform pGWAS analysis include the following three approaches: (i) Singlevariant approach; (ii) Gene-scoring approach; and (iii) Pathway-sub-network-based approach. However, pGWAS approaches could be categorized into many classes based on their usage (see section below). Also, there are several methods/tools for pGWAS reported in different articles.
In this article, we provide an overview to performing pGWAS analysis and discuss the challenges behind each method. Furthermore, we direct readers to key articles for each pGWAS method and present the overall issues in pGWAS analysis. Finally, we include a custom pGWAS pipeline to guide new users when performing their research.

GWAS analysis
GWAS have become an indispensable approach in providing insight into understanding genotype-phenotype associations of complex disease. While the design for GWAS experiments are well established as bench-work, the computational methods for the analysis of GWAS data are still evolving. The typical computational pipeline to analyse any GWAS data consists of two essential tasks: a) upstream GWAS analysis (classic GWAS methods), and b) downstream GWAS analysis. The latter step is known as pGWAS analysis.
The upstream GWAS analysis is a multi-step task resulting in a list of statistically significant SNPs. This step often starts by checking the quality control of raw GWAS data which is a crucial step for the task of pGWAS analysis. It is obvious that unclean data lead to unreliable results. Therefore, many parameters such as quality control per sample, relatedness, replicate discordance, SNP quality control, sex inconsistencies, and chromosomal anomalies should be checked. 13,14 After obtaining raw data-sets with high quality scores, the next step is to report the statistically significant SNPs. Many statistical tests are available for this, including Chi-square test, Fisher's exact test, Cochran-Armitage trend test, Odds ratio, Logistic regression, ANOVA, Transmission Disequilibrium test, Bonferroni correction, and many other methods.
Many free tools are available to perform GWAS upstream data analysis such as: Plink, 15 PLATO, 16 EIGENSOFT, 17 and STRUCTURE. 18 Several other tools are also available as packages within R software, the most popular open-source software for statistical computing. 19 pGWAS approaches Single-variant approach Genome wide significant threshold. This approach provides the statistics at a SNP level. This approach aims to score the association between SNPs and the target trait. Usually, the score defined by a β value or Odd Ratio (OR), and its standard error (SE) gives an idea of the genotype's effect on the phenotype. Effect strength is computed using OR. Also, scientists use p-value as a measure of how likely this effect is to occur by chance. Statistically, many researchers use p-value to evaluate the hypothesis that there is no statistical evidence for SNP-trait associations. However, this hypothesis will be rejected when p-value is less than a predetermined threshold that is adjusted for a multi-test. Many methods are used for multiple test p-value correction. These methods include Bonferroni correction and false discovery rate (FDR). 20 However, some of these methods do not consider linkage disequilibrium (LD) and they may be too stringent. In general, a threshold of 5e-08 is acceptable for human association studies. 21,22 This value has been computed using independent signals in genomes. However, this value is not absolute and it depends on many factors, including LD (and diversity), array type, whole genome sequencing or whole exome sequencing, position number in array or sequencing, imputation panels, and positions finally analyzed. 23 For trans-ethnic populations, it is recommended to estimate the threshold based on population diversity and LD. For instance, analyzing 1000 Genomes data, the suggested significance thresholds were 3:24e-08 for Africa, 9:26e-08 for Europe, 1:83e-07 for Mixed America, 1:61e-07 for East Asia and 9:46e-08 for South Asia. 24 Although, a recent study using an African population proposed a threshold of 5e-09. 25 Suggestive threshold. Some authors consider GWA threshold very stringent, so studies often include a "suggestive threshold". This is superior to a genome-wide significant threshold, and values found in the literature are 1e-05, 1e-06 or lower. Studies estimated on "independant LD block" from 1000 Genomes used 1e-05 for the Affymetrix 500K and Illumina 317K GWAS SNP panels, and 1e-06 for HapMap CEPH Utah and Yoruba populations. 26 The code below contains an R command to select significant SNPs using a cutoff value of 5e-08.
## library need : data. Inflation factors. Population and cryptic relatedness can cause spurious associations in GWAS with p-values higher than random leading to false positive signals. Genomic control (GC) approach is extensively used to effectively control false positive signals. Genomic inflation factors (λ) can be computed as the median of the resulting chi-squared (χ 2 ) test statistics divided by the expected median of the chi-squared distribution. 27 Refer to equation 1 below Z (Beta=Se), χ 2 and p-value can be used to compute inflation factor, using Z 2 for Z, quantile of 1 -p-value at 1 degrees of freedom. The code below demonstrates how to compute inflation factors using the build-in R function qchisq that can be used to calculate value of quantile for a χ 2 distribution.
# compute inflation factors ## use quantile function of chisq to p.value data.gwas$p.value.qchisq <-qchisq(data.gwas$p.value, 1, lower.tail=FALSE) ## computed lambda median(data, na.rm=TRUE)/qchisq(0.5, df) Global visualisation of results. A common way to visualize GWAS results is the Manhattan plot. The Manhattan plot demonstrates the physical location of SNPs distributed by chromosome in the x-axis with the degree to which a SNP is associated with the target trait, i.e. p-value scaled by log10 in the y-axis. A Manhattan plot can be done in R software using the qqman package, which includes functions for creating Manhattan plots and q-q plots from GWAS results. 28 The code below demonstrates how to create Manhattan plots, and q-q plots the qqman R package.
## do a qqplot and Manhattan plot using library qqman library('qqman') ## used function qqplot from qqman qqplot(data.gwas$p.value) ## used function manhattan from qqman manhattan(data.gwas, chr="chr", bp="bp", p="p.value",snp="rsid") Local visualisation of results. Local visualisation can be done using the method described by Pruim et al. 29 In this method, information of LD, genes, and previous results of GWAS can be added. Zoom version 2 offers a virtual analysis of local GWAS results. On the other hand, LocusTrack from the UCSC genome-browser adds more annotation compared to LocusZoom. 30 Furthermore, BigTop 31 is capable of providing three-dimensional visualisation of data using allele frequency as a third dimension. The code below demonstrates how to use LocusZoom to visualize SNPs considering LD information.
## use R to reformat your file to be used by locuszoom R -e "library(data.

Gene-scoring approach
This approach considers the association between a trait and all SNPs within a predefined window around genes rather than each marker individually. In many cases, this approach is more powerful than traditional individual-SNP-based GWAS.
A gene score (GS) is a value given to a gene representing some measures related to a genetic trait. Thus, all the statistical summary values such as p-values and fold changes could be considered gene scoring values. For pGWAS analysis, GS is defined as the sum of all statistically significant alleles (i.e. the risk alleles) of the selected SNPs present in each individual under investigation. 32 Various algorithms have been developed to calculate GCs based on GWAS summary statistics. [33][34][35] It is a common approach to encode and adjust SNPs values prior to calculating GC during analysis. 32 The process of SNP encoding aims to ensure that all SNPs are positively correlated with the outcome. 32 One more essential aspect to consider while calculating GS is the effect size. Variation in the effect size reflects reduction of predictive power of GS. 32 The GS method mostly used in pGWAS analysis is the gene-based p-value.
Gene level p-value. In many pGWAS pipelines, the first step in downstream GWAS analysis is to assign the SNPs to functional genomic features. The latter includes: coding genes, non-coding RNAs, 5'UTR, 3'UTR, proximal promoters, regulatory element, and enhancer elements.
Two common methods for assigning SNPs to their corresponding genes are GLOSSI 36 and VEGAS. 34,37 GLOSSI is available as an R package, while VEGAS is available as an online tool as well as a stand-alone tool to be run on a local machine. Besides VEGAS and GLOSSI, many pGWAS tools provide procedures to calculate genes' p-values. For instance, ancGWAS provides four different methods to calculate the genes p-values: Simes, Smallest, Fisher, and Gwbon. Similar to ancGWAS, MAGMA provides three methods to calculate genes' p-values. However, one of the MAGMA methods is similar to the VEGAS method. The other two methods are by considering either the smallest SNP p-value, or the highest SNP p-values.

Pathway-sub-network-based approaches
This approach considers the fact that complex biological phenomena addressed by GWAS, including the molecular basis of the rare disease, often arise due to gene interaction rather than single gene effect. Therefore, scientists analyse GWAS based on the biological network theory to understand the disease-causing genes and mechanisms involved in traits and complex diseases, such as in rare diseases. This approach is based on the results obtained from the gene-based association test and provides a higher level of complexity by considering biological networks and/or genes ontology. Analysing GWAS based on biological networks allows us to capture biological interactions between various molecules such as proteins, functional DNA motifs, coding and non-coding RNA, as well as disease mechanisms and to consider epigenetic changes, including methylation states or other modifications (phosphorylation, acetylation, etc). Also, this approach aims to map genes that are associated with significant SNPs into known pathways/Gene Ontology terms. The result of this approach provides information about the over-represented pathways in a given set of genes/SNPs.
Fine-mapping and lead SNPs GWAS will often identify a number of SNPs in LD with each other as being associated with the phenotype. The lead SNPs are those with the most significant p-valuethey may be causal or not. The other SNPs in this region may only have an association because they are in LD with the causal SNP or they may be independently associated. This GWAS limitation may be observed when integrating information of variability in allele frequency, SNPs in LD block and imputation. 38,39 For instance, when simulating GWAS results using OR value of 1:5 and allele frequency of 0:5, only 21% of the simulations demonstrated that the identified causal variants were not the most strongly associated variants. 40 On the other hand, changing the previous parameters OR value of 1:1 and allele frequency of 5%, resulted in 2:4% concordance between causal variants and lead SNPs. 40 Therefore, changing the previous parameters OR value of 1:1 and allele frequency of 5%, resulted in 2:4% concordance between causal variants and lead SNPs. 40 Therefore, fine-mapping methods are used to identify causal variants that are associated with the target trait and the number of putative causal variants from GWAS data. The fine-mapping approach integrates summary statistics from GWAS data, LD and functional annotations. There are two fine-mapping methods: (i) heuristic method that penalizes regression models, and (ii) Bayesian fine-mapping methods. 38,39 Lead SNPs. Lead SNPs are defined as independent SNPs that have reached a minimum p-value threshold, i.e. they are independent of each other at the LD threshold. It is common to measure LD as r 2 where the square of the correlation coefficient between any two indicator variables is r 2 . 41 PLINK has implemented a function called'ld-clump' that clumps independent SNPs. 15 Moreover, FUMA tool 42 identifies lead SNPs by double clumping method. The first clumping is used for clumping SNPs with p-value < 0:05 at genome-wide significant p-value, i.e. p-value < 5e-08 and independent at r 2 < 0:6. This first clumping function reports significant independent SNPs. The second clumping is of significant independent SNPs at r 2 < 0.1 and reports lead SNPs. The code below demonstrates how to use the Plink tool to perform clumping to get lead SNPs.
## define lead snps using plink # --clump-p1 Significance threshold for index SNPs # --clump-p2 Secondary significance threshold for clumped SNPs # --clump-r2 LD threshold for clumping # --clump-kb Physical distance threshold for clumping # --clump-snp-field your snp field must be same than in your assoc file # --clump-field: p value header plink --bfile plinkbase --clump result.gwas --clump-snp-field rsid\ --clump-field p.value --clump-p1 0.001 --clump-p2 0.1 \ --clump-r2 0.1 --clump-kb 250 --out assoc Heuristic fine-mapping approaches. These are used to identify potential causal SNPs. 38 They are proposed to filter SNPs around lead SNPs considering the value of their pairwise correlation r 2 . They consider a hierarchical clustering technique to cluster all SNPs in a given region based on their pairwise r 2 . Another way is to use LD block by direct extraction of block and the selection of the same position on the block or by the visualisation representation of LD using LocusZoom 29 or Haploview. 43 Nonetheless, the described method is not a statistical approach to define putative causal variants, such as Bayesian methods or regression models. The code below demonstrates how to combine Plink tool and Haploview to perform heuristic fine-mapping.

P MjD ð Þ
Users first define an initial number of causal variant c (between 1, m SNPs). This number c is defined using genome-wide significant SNPs. To model Bayesian statistics, software will define a model M that contains c SNPs. This model often will restrain choice to the significant SNPs or suggest significant SNPs (m) in the windows. Considering the rule of combination C m,c ¼ m! c! nÀc ð Þ! , the probability model P DjM ð Þis relatively easier to compute. The following Bayesian rules can be used Most of the tools are aimed to find a maximum probability to have the best combination of the causal variants. To model Bayesian statistics, many tools integrate various data such as LD and GWAS summary statistics.
Bayesian methods: posterior inclusion probability. This approach is used to compute post inclusion probability (PIP) at each SNP i. The PIP is computed by the sum of the posteriors over all models that include SNP i as a causal variant. 38 Using the rank of PIP is an excellent way to select putative causal variants. 38 The PIP approach should be used with caution to identify causal variants in the high LD regions. Therefore, it is recommended to estimate the posterior expected number of causal SNPs by summing the estimated PIPs for all SNPs in the region.
Bayesian methods: credible sets. This approach is used to define a set of variants that could have a good candidate. One way to estimate a credible set is to use PIP by ranking the values and doing the cumulative sum of PIP from the largest to the smallest. Then select all variants where the sum is less than the predefined α cutoff value. In general, researchers use 99% or 95% as a recommended value for α. Studies have shown that credible set and coverage probabilities are overconservative in most fine-mapping situations as data sets are not randomly selected from among all causal variants. Therefore, an adjusted coverage is proposed to reduce such over-conservation in this approach. 44 The code below demonstrates how to combine Plink tool and FINEMAP to perform Bayesian-based fine-mapping analysis.
# extract a range of interest plink -bfile $plk --keep-allele-order --chr chr --from-kb begin --to-kb end --make-bed -out plink_range #compute LD plink --r2 square0 yes-really -bfile plink_range -out ld_range # format ld sed 's/\t/ /g' tmp. Bayesian methods: trans ethnic fine-mapping. This is used to perform trans-ethnic fine-mapping studies using simulation results in similar fine-mapping resolution among the European and Asian ancestries. However, the inclusion of samples with African ancestry in meta-analysis leads to a significant improvement in fine-mapping resolution due to the lowest LD in the African ancestry population. The probability that the lead GWAS variants were also the causal variants increased using trans-ethnic GWAS data. 40, 45 It is a process that relies on disparate LD patterns in populations of diverse genetic ancestries to localize the causal variants. This approach has been successfully implemented to fine-map and leads to several common GWAS findings. 46 Integrating annotation into fine-mapping. Functional annotations have been shown to improve the discovery power and fine-mapping accuracy. Therefore, some tools such as CausalDB, 47 PAINTOR 48 and BIMBAM 49 have integrated expression quantitative trait locus (eQTL) information in fine-mapping approach.
Bayesian methods: software. Several Bayesian-based fine-mapping tools have been developed using summary statistics, LD and eQTL (Table 1). Also, some pipelines have integrated different Bayesian-based software in order to compare their results (fine-mapping of h3agwas, fine-mapping in FinnGen, and FM pipeline).
Other fine-mapping approaches. Other approaches, such as regression models, are used with all SNPs in the lead SNPs region to analyze SNPs jointly. The comparison of various approaches including elastic net, ridge, Lasso, MCP, and the normal-exponential shrinkage prior, have shown that penalized methods outperform single marker analysis. 53 Furthermore, a forward/stepwise regression can be used to test the independence of multi SNPs using the following algorithm: • Order the list of SNPs by their p-values p 0 ,p 1 ,…, p nÀ1 ½ • Remove highest p-value • Apply models with p 1 ,…, p nÀ1 ½ with p 0 as a co-variate and check if any is significant • Repeat process.
The R library SusieR has implemented a method of regression fine-mapping analyses.
Other resources for fine-mapping analyses. Several review articles have been published for fine-mapping analyses. Nevertheless, we recommend the following scientific articles as a good source for beginners: "From genome-wide associations to candidate causal variants by statistical fine-mapping", 38 "A practical view of fine-mapping and gene prioritization in the post-genome-wide association era", 39 and "Fine-mapping genetic associations". 54 Conditional association and imputation using summary statistics The aim of performing conditional association and imputation using summary statistics is to evaluate the association between SNPs and biological trait by combining various GWAS summaries from different studies. This method requires a reference population to estimate LD information. The imputation method performs meta-analysis to infer the missing genotypes among the studies before evaluating the association between the SNPs and the biological trait. This method estimates the effects of many variants that are not directly genotyped.

Polygenic predictions of disease risk
This method is used to predict disease risk using GWAS summaries. 55 Polygenic risk score (PRS) could be used to predict an individual's likelihood to develop a specific trait or to estimate the level of predictive power that the trait is associated with a particular set of variants. 56 Although PRS methods are classified into Bayesian-based methods and non-Bayesian methods, there are more classifications of underlying PRS methods. 57 PRS is calculated by aggregating effects from a large set of causal SNPs. Several tools were developed to calculate PRS. For performing PRS studies, we highly recommend this recent review paper. 58 PRS is usually computed after the challenges associated with GWAS are carefully addressed. Here, we demonstrate  K is the number of dimensions, which needs to be defined (typically 10). This is an important step of the QC that consists of multiple proceedings.
Chip heritability. The chip heritability (h 2 SNP )is also known as SNP-based heritability, which is defined as the portion of the phenotypic variation that the genotyped genetic marker can explain. 60,61 Higher values of heritability indicate that the phenotype is explained best by the genotype, i.e. set of SNPs. Choi et al. 58 recommend h 2 SNP > 0:05 to perform PRS analysis. To estimate h 2 SNP , researchers should use LD score regression that could be used to distinguish polygenicity (SNPs effects) and confounding biases, including cryptic relatedness and population stratification. 62 The code below demonstrates how to use Plink to perform quality control checks and calculating PRS. # $bfile: Target data set in Plink binary format. # $QC: Base data set i.e., base GWAS summary statistic. This file contains P-value information # $p1: P-value threshold for a SNP to be included as an index SNP. Choose value of 1 if you want to ##include all SNPs are include for clumping. # $r2: Cutoff for r2 value i.e., SNPs having value higher than given r2 will be removed # $kb: cutoff value window size in kilobase, i.e., SNPs within $kb of the index SNP are considered for clumping. # $SNP: Name the column SNP that containing the SNP IDs # $P: Name of the column that containing the P-value information # $Output: Output file ################### Quality Control of Target Samples plink\ --bfile $bfile \ --maf 0.05 \ --mind 0.1 \ --geno 0.1 \ --hwe 1e-6 \ --make-just-bim \ --make-just-fam \ --out $Output.qc plink \ --bfile $bfile \ --score $QC 3 4 12 header \ --q-score-range $listpvalue \ --extract $valid.snp \ --out $Output

Meta-analysis
The meta-analysis approach can be used to evaluate the association between SNPs and biological traits by combining various GWAS summaries from different studies. The following paragraphs will provide the key concepts for performing a meta-analysis. To have concrete information about the meta-analysis approach, we recommend this review article. 63 Heterogeneity of source. Heterogeneity in data could be derived using GWAS summary statistics. The standard variables to estimate heterogeneity in data include odds ratios, standardized effect sizes, other metrics along with their uncertainty (e.g. variance or 95% confidence interval) and the accompanying p-values. However, there might be many other variables for each dataset that are important to deal with in order to estimate heterogeneity in data.
Standard meta-analysis. This is used to perform the meta-analysis approach which is to sum the Z-scores across all studies and weigh them appropriately using the sample sizes. See equation (3) below.
where Z k is Z-score from K t h study, and w k weight of studies relative to population size.
Independence of the samples. Conventional meta-analysis has an assumption that assumes that effect sizes are independent. Simulation studies demonstrate that failure to account for overlapping samples could greatly inflate type I error. 64 If accounting for the overlap is unavoidable, the overlap/covariance can be estimated using Z's covariance between summary statistics. Some tools can account for overlapping samples, such as METAL software 65 and ASSET. 66 Correcting for population structure with genomics control. The presence of population structure in the GWAS study can impact an over-dispersion of the corresponding association test statistics. One approach to limit this problem is to correct statistics of each summary using genomic control. This correction factor is given as the inflation factor (λ) which is the test statistics' median divided by its expectation under the null hypothesis. 67 p-values versus Z scores. Meta-analysis methods based on p-values were widely used in different scientific fields until the 1980s. Then became unpopular and almost abandoned in biomedical sciences. Nowadays, the meta-analysis approach is performed using Z-score. There are two methods to estimate Z-score for GWAS data. The first method is demonstrated in equation (4).
In the second method (equation (5)), Z-score is estimated using p-value and the effect of allele.
where sign Δ i ð Þ is a sign of relation.

Random effects versus fixed effects.
In the presence of variability of allelic effect as in trans-ethnic studies, it is common to perform a random-effect meta-analysis to correct variability of β effect between different studies. For instance, GWAMA tool 68 computes a random-effects variance component using Cochran's statistic (Q-value) to balance weight used in meta-analysis. On the other hand, Metasoft 69 proposed two other different methods to take into account the heterogeneity, which are Random Effects model 70 and binary effects model with m-value, i.e. weight of each summary study in summary statistics. 71 Heterogeneity test. Heterogeneity at a locus can be reflected in the variability in population or environment. It can be relevant to gene-environment interaction and the reason behind the variability in GWAS approach of each data-set, i.e. covariable, model and approximation of GWAS between summary statistics. The heterogeneity is computed between two sets of summary statistics rather than one locus.
Cochran's statistic provides a test of heterogeneity of allelic effects at SNPs j using equation (6) below.
where N denotes study number.
Alternatively, we use Q statistic to quantify the extent of heterogeneity in allelic effects across studies. 72 See equation (7) below Other meta-analysis approaches. Traditionally, meta-analyses of GWAS have focused on combining results of multiple studies for similar traits. The Bayesian framework has been tested to estimate β on different phenotypes. 73 MetABF tool 74 has implemented a method to perform meta-analysis across genome-wide association studies of diverse phenotypes. It is important to note that a recent review on cancer suggests that it is possible to obtain "noteworthy" Bayesian results at higher p-values that are not considered statistically significant in GWAS. 75 Study alignment and error trapping. Meta-analysis aggregates various summary statistics. Therefore, any error to designate the effect allele and other allele or strand issue can cause an error in estimate β. Such errors might lead to misleading meta-analysis results as it increases Q of Cohran and heterogeneity between studies.
Effect size. By conducting a meta-analysis, researchers often neglect the sample size variation among different studies "true effect sizes are the same across studies". However, in some cases, researchers introduce the correct effect size by considering the posterior probability for each study. Some software for estimating effect size are given in Table 3.
Meta-analysis output. Meta-analysis provides a new set of summary statistics. For each position that has not been discarded, new statistics will be calculated. These statistics include new values for β, σ and p-value. Users should be aware of the LD and the reference population used.
More resources for meta-analysis. We recommend h3bionet/h3agwas for meta-analysis pipeline. In addition, those interested in meta-analysis are directed to a review published by Zeggini et al. 77 and another review article by Evangelou and Ioannidis. 63 The code below demonstrates how to use the Metasoft tool to perform meta analysis.  Statistics for colocalization studies. Several parametric and non-parametric statistics can be done for colocalization studies. 79 Resources for colocalization studies. We recommend the following scientific resources for those who are beginners in this field: • From GWAS to function: using functional genomics to identify the mechanisms underlying complex diseases. 78 • Colocalization analyses of genomic elements: approaches, recommendations and challenges. 79 • Bayesian test for colocalisation between pairs of genetic association studies using summary statistics. 80 • LocusFocus: web-based colocalization for the annotation and functional follow-up of GWAS colocalization of GWAS and eQTL signals detects target genes. 81 • A powerful and versatile colocalization test. 82 Using summary statistics from multiple phenotypes and traits. The methods for GWAS are mostly focused on single variant analysis with a single phenotype or trait. Increasing evidence shows that pleiotropy, one gene's effect on multiple phenotypes, plays a pivotal role in many complex traits. Therefore, associating different GWAS results for multiple phenotypes can provide an extensive power by aggregating multiple weak signals. 83 Different approaches have been developed to integrate dependent p-values to assess the association between a gene and multiple correlated phenotypes. 83 Several tools exist to perform multi traits analysis, including Multi-Trait Analysis of GWAS (MTAG) 84 and CPASSOC package. 85

Mendelian randomisation
Mendelian randomisation (MR) is a statistical approach that can be defined as "the use of genetic variants as instrumental variables to investigate the effects of modifiable risk factors for disease". 86 For instance, one trait (phenotype or disease) might be affected by confounding or reverse causation rather than a conventional observational variable. Therefore, MR aims to provide a statistical frame to verify the causality between locus and phenotype and exclude pleiotropy. Such methods will provide a reliable explanation of the results.
Assumption of MR. There are three main assumptions for MR. [87][88][89][90] These are (i) the genetic variant that is associated with the exposure (significant association), (ii) the genetic variant that is independent of the outcome given to the exposure and all confounders (measured and unmeasured) of the exposure-outcome association, (iii) the genetic variant that is independent of factors (measured and unmeasured) that confound the exposure-outcome relationship.
Statistical methods for MR. MR general strategy is to compare beta values or p-value of the same position of two or more different GWAS using related/confounding phenotype. Several methodologies have been developed for MR analysis, an example is the ratio of coefficients estimator, which can be modeled using equation (8): where β e represents the change in exposure per variant allele, and β o represents the change in outcome per variant allele. Another model is the two-stage least squares. It employs a two-stage regression approach with two regression models where the first stage regression's output is used as the input of the second stage regression. More methods for MR exist, including control function estimator, limited information maximum likelihood method, verse variance weighted method, and MR-Egger method. 91 MR software. Several tools have been developed to compute MR, e.g GSMR (Generalised Summary-data-based Mendelian Randomisation) from GCTA, TwoSampleMR (version 0.4.20), MR-PRESSO, and MR-LDP which is integrated LD information. We recommend the following tutorial: https://bioconductor.org/packages/release/bioc/ vignettes/GMRP/inst/doc/GMRP.pdf. 92 The code below demonstrates how to use gcta64 tool to perform Mendelian randomisation analysis.

Our recommended pGWAS pipeline
Our proposed pGWAS pipeline consists of three main steps: preprocessing, visualization, and the downstream pGWAS analysis (refer to Figure 1).
Step 1, the preprocessing step, aims to control checks and ensure a correct input file format for the downstream pGWAS analysis. The main purpose of this step is to ensure that SNPs' positions in the GWAS summary file accurately match the genomic coordinates in the downstream reference panel if any is available. The UCSC LiftOver tool (http://genome.ucsc. edu/cgi-bin/hgLiftOver) is widely used to correct genomic position mismatches between the GWAS summary file and the reference panel. However, other options exist, including: Bioconductor rtracklayer package, 96 Assembly Converter, 97 NCBI Remap, 98 and the CrossMap tool. 99 Step 2, the visualization step, aims to visualize the raw input GWAS summary data pictorially, primarily through two scatter plots: Manhattan plot and quantile-quantile (Q-Q) plot. The Manhattan plot is widely used in genomics to visualize the results of GWAS studies. In the Manhattan plot, the X-axis represents the positions on chromosomes, while the Y-axis reflects genomic association strength with a given trait. The Q-Q plot is used to check the normality of data -mainly the normality of p-values distribution.
Step 2 can be completed using the qqman R package. 100 Step 3, of downstream pGWAS analysis, can be divided into three approaches based on their underlining data heterogeneity. First, if there is homogenous data, researchers can perform a single variant pGWAS analysis such as χ 2 to test the association between a particular variant and trait. Furthermore, researchers can perform gene set analysis or network/pathway analysis to understand the biological function underlying a list of statistically significant variants. For instance, the MAGMA tool can be used to conduct gene set-based pGWAS research, 101 while pathway analysis could be done using the PASCAL (Pathway scoring algorithm) tool. 102 Researchers can also undertake fine-mapping analysis and MR for homogenous GWAS summary files. Second, researchers can perform PRS analysis and genetic risk score if individual-level data is available. Third, if there is heterogeneous data from numerous independent studies, a metaanalysis can be performed.

Final remarks
This articles demonstrates various pGWAS methods. The advancement in these pGWAS techniques solves a significant problem in our efforts to understand the vast amount of data generated and explore fundamental biology. However, several issues should be taken into consideration when performing pGWAS analysis across trans-ethnic GWAS studies. Some of these issues include: (i) heritability of the trait, (ii) GWAS sample size, (iii) polygenicity of the traits, (iv) genetic architecture of the trait, and (v) genotype-environments interactions. Furthermore, we expect that in the future many pGWAS methods will be developed to address these limitations either for a particular ethnic group or for multi-ethnic groups. Step 1 aims to perform data preprocessing, Step 2 for data visualization, and Step 3 for downstream pGWAS analysis.

Data availability
No data is associated with this article. findings presented in the article? Yes