Automated quality control for genome wide association studies

This paper provides details on the necessary steps to assess and control data in genome wide association studies (GWAS) using genotype information on a large number of genetic markers for large number of individuals. Due to varied study designs and genotyping platforms between multiple sites/projects as well as potential genotyping errors, it is important to ensure high quality data. Scripts and directions are provided to facilitate others in this process.


Introduction
Biases and errors can lead to erroneous associations in casecontrol association tests. Quality control (QC) that removes markers and individuals from a study that may introduce these biases can greatly increase the accuracy of findings. There are many examples of best practices for GWAS QC 1,2 . This paper describes some standard QC steps and also provides links to automated scripts to perform QC making the process easier and easily reproducible. Standard tools such as PLINK 3 and SMARTPCA 4,5 are called by the scripts.
Due to the need for reproducibility in science, automated pipelines that can be used to repeat computational experiments and save relevant parameters is extremely important. Done step-by-step, the QC process can be quite lengthy (about 8 hours for an expert and almost certainly longer for a novice and/or someone with limited computational resources according to Anderson et al. 1 ) and difficult to repeat exactly. Here we present scripts that perform automated GWAS QC using a parameter file that can be saved to redo the process and save human time. A log file is produced that summarizes the process to easily compare different QC parameters and their effects on the data.

Methods
Implementation QC steps implemented in this pipeline. The steps automated here mostly follow the notes on QC 6 developed by MikeWeale and also calls some R 7 scripts described in his notes during the QC pipeline. It is assumed that input files are already in PLINK format. Figure 1 shows a complete QC pipeline that includes combining data from multiple chromosomes and studies and two portions of the QC pipeline. There are two scripts, QC.py which takes advantage of PLINK calls and also PCA.py that does principal component analysis to investigate population stratification.

Gender mismatches
The optional first step in the automated pipeline is a check for gender mismatch using the PLINK '--check-sex' command. This command compares the sex reported in the .fam file and the sex imputed from the X chromosome inbreeding coefficients. This step automatically removes individuals where problems are identified. The step was made optional because our dataset of interest is matched with phenotype/clinical data of higher accuracy. This step can be turned off using the parameter file as described below in the Operation section.

Thresholds
The next steps in this pipeline include checking and applying thresholds for minor allele frequency (MAF), missingness for each individual, and missingness of markers. Minor allele frequency filtering is important because rare genotypes will not show up as often and thus will have less evidence in a GWAS and the calls will be less certain and it is also difficult to detect associations with them. Missingness can lead to false associations if it is non-random with respect to phenotypes or genotypes. Single nucleotide polymorphism (SNP) missingness is the complement to individual missingness and is correlated with SNP quality from the original genotyping assay. Missingness is investigated using PLINK '--missing' and plots are generated as described by Weale 6 . All the plots that are generated during the process are compressed at the end in order to facilitate downloading them when the process is performed remotely on a cluster.
In order to attempt to retain the largest number of markers and individuals that pass QC there is an option to do a two-tiered missingness by individuals filtering. We noticed during testing that this could sometimes lead to final datasets with higher numbers of both. If a value is supplied to the #MIND1 parameter (described below in Operations), then this (expectedly non-stringent) threshold for PLINK '--mind' is used first and the more stringent #MIND is applied in the same step as the PLINK '--geno' and '--maf' thresholds for missingness of markers and minor allele frequency, respectively. If a major reduction in the number of markers or individuals is found during these steps, investigation of the generated graphs can help adjust these thresholds. See notes 6 for more information.
Some reasoning 8 suggests that a minor allele frequency threshold should be set to 10/n where n is the number of markers. The #MAF parameter can be set to 'na' which will use 10/n as a threshold or a threshold value can be explicitly given, such as '.01'.

Hardy-Weinberg equilibrium (HWE)
Markers out of HWE can indicate that there were genotyping errors. However, a strong association signal can also result in deviations from HWE. So here only variants from control samples are checked for deviations from HWE. PLINK '--hardy' is used to generate HWE p-values and a Q-Q plot of the log-P-values of the markers for the controls is generated for manual investigation. A p-value threshold is supplied in the parameter file to remove markers with a pvalue lower than expected.

Cryptic relatedness
Cryptic relatedness (CR) is when pairs of individuals are closely related and can lead to false positive or negative correlations when subjects are treated as independent. The PLINK '--genome' command can estimate relatedness, but is quite slow when there are a large number of markers in a dataset. Therefore, markers in high linkage disequilibrium (LD) are removed first to thin the data. This is done using PLINK '--indep-pairwise' with parameters suggested by Weale 6 . This creates a pruned data set that contains markers with a minimal LD (which is caused by limited recombination occurring between two or more loci and results in a non-random association between the loci). Furthermore, only assayed markers are used in this step (i.e. not imputed markers). Using a pruned data set is advantages because CR methods work best when no LD is assumed between markers and it also reduces the input size and in turn greatly reduces the computation time.
PLINK '--genome' estimates relatedness of all pairs of samples and reports identify by decent (IBD, a measure of whether identical regions of two genomes were inherited from the same ancestry) in the PI_HAT (actually, proportional IBD, i.e. P(IBD=2) + 0.5*P(IBD=1)) column of the result file. A PI_HAT value close to 1 would indicate a duplicate sample. The threshold 0.1875 represents the half-way point between 2nd and 3rd degree relatives and is a common cut-off to use. Of each pair of related individuals, the one with the greater proportion of missing SNPs is dropped from the final dataset.
6. Principal component analysis (PCA) Generally, PCA transforms a data matrix (such as a GWAS n x m matrix where n in is the number of individuals and m is the number of markers and each element in the matrix represents the scaled genotype for the particular individual at that particular marker) so that the successive principal components are not correlated. The number of PCs is less than or at most equal to the original number of columns and the first PC explains the largest variance in the genotype data. Traditionally, PCA is used to (1) screen the study population for heterogeneous ethnic backgrounds and (2) to correct for potential population stratification (the difference of allele frequencies in ancestral subpopulations). It can be seen in Figure 2 where HapMap 9,10 data with individuals with known ancestry are included in the PCA, when plotting the first two PCs subpopulations cluster together. HapMap is an international project that aims to identify genetic similarities and differences between populations.
As with the cryptic relatedness step, a thinned dataset created with PLINK and starting from assayed markers only is used to calculate PCs. The SMARTPCA tool is used to calculate PCs from this thinned dataset and identify outliers for removal. The PCs can then be used for further corrections in analysis models.
Data Formats. The input GWAS data are expected to be in PLINK bfile format. The input data will have three files associated to it with .bed, .bim, and .fam file extensions. The .bed file is a binary file that contains the genotype information for all individuals (https://www. cog-genomics.org/plink2/formats#bed). The .bim file is a mapping file giving information on each marker (https://www.cog-genomics. org/plink2/formats#bim). The .fam file gives information on each individual (https://www.cog-genomics.org/plink2/formats#fam).
This pipeline utilizes information that was not provided in the original PLINK files and therefore the phenotype is always provided in an alternate phenotype file. PLINK '-pheno' is used to provide the phenotype file and PLINK '-pheno-name' is used to provide the phenotype name which also corresponds to the header of the column in the phenotype file. The first two columns in the phenotype file must have the column headers 'FID' and 'IID' respectively. 'FID' is the family ID or '0' if not used and 'IID' is the individual ID that corresponds to the 'IID' values in the .fam file (https://www.cog-genomics.org/plink2/input#pheno).

Operation
System. The pipeline was tested on STATGEN, a Dell PowerEdge R520 server with two Intel Xeon E5-2470 CPUs (32 cores at 2.3GHz), 24TB of storage in a RAID6 array with two drive fault tolerance, and 128GB of RAM. The operating system is Ubuntu Server 14.04 LTS 64-bit edition. 16. PHENO -the name of the phenotype that will be investigated during analysis. Some QC steps vary based on whether or not an individual is case, control, or undefined.
17. PFILE -the files in which the phenotype is given formatted in the proper PLINK format with FID, IID, pheno1 column headers where FID is the family ID (commonly 0 if relations not known), the individual's ID, and the phenotype value encoded as 1=unaffected (control) and 2=affected (case).

QC.py
Once the parameters and values are correctly written to the parameter file, the script is executed by calling the parameter file as a command line argument as follows, python QC.py qc_params.txt The QC steps are performed as described in the Implementation section and after each step a count is retrieved for the number of individuals or markers removed in each step. A log file is written (final_QC.log). All of the parameters from 2. PCA.py Once the parameters and values are correctly written to the parameter file, the scripts is executed by calling the parameter file as a command line argument as follows, python PCA.py pca_params.txt The data thinning is carried out in PLINK, the input file for SMARTPCA is automatically generated, and then SMART-PCA is called to calculate the PCs. If the analysis is being done in R, then the 'smartpca.evec' file can be read in and merged to exclude individuals in which PCs were not calculated (i.e. outliers). PCA.py also generates a file called 'remove.txt' in PLINK format with an FID (all marked '0') and IID column so that the outliers can be removed during PLINK analysis with the '-remove' command.

Conclusions
While the QC steps given here are not novel, this paper provides access to an automated process that both reduces human work time and chances for error and provides tools to make the computational experiment reproducible. It also gives recommended values for parameters but facilitates changing parameters and the comparison of effects.

Competing interests
No competing interests were disclosed.

Grant information
This work was supported by the National Institutes of Health (NIH) National Center for Advancing Translational Science grant KL2TR000116 and the University of Kentucky Center for Computational Sciences.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 1.

Title and abstract:
The title and abstract are appropriate and suitably summarize the work.
The article describes a series of scripts that the authors have developed to provide Article content: quality control measures for genomewide association studies The scripts address critical questions for genomewide association study datasets such as checking the extent of heterozygosity to determine whether it is as expected, checking whether there are markers that do not follow Hardy-Weinberg equilibrium predictions, and checking whether there are pairs of closely related individuals. The authors provide a helpful series of scripts that will allow users to perform these tasks in a reproducible and automated fashion. The scripts make it easy to determine the number of samples removed for different parameter settings. These scripts are likely to be valuable for researchers in the field.
The conclusions of the manuscript are appropriate, that use of these scripts will allow for Conclusions: consistent quality control of genomewide association study datasets, which will lead to better interpretation of the available data.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. The current manuscript by Ellingson is regarding the automation of the quality control (QC) of et al., genome wide association studies (GWAS). Automation is important as it reduces human errors and increases work efficiency. In this manuscript authors have written a python script that automates the process of QC of GWAS data. The QC steps in this pipeline follow the notes developed by MikeWeale et al.
However, there are some issues with the current version of the manuscript.
There is/are already software like QCGWAS , which are automated for QC of GWAS. So how is the current version of the software different or better from the already existing softwares. So authors should try to summarize by comparing in terms of efficiency, accuracy in performance of their work with the already present softwares in this field.
Authors should provide a link to a test dataset that can be used to check if the required softwares are installed and running properly.