snpQT: flexible, reproducible, and comprehensive quality control and imputation of genomic data

Quality control of genomic data is an essential but complicated multi-step procedure, often requiring separate installation and expert familiarity with a combination of different bioinformatics tools. Software incompatibilities, and inconsistencies across computing environments, are recurrent challenges, leading to poor reproducibility. Existing semi-automated or automated solutions lack comprehensive quality checks, flexible workflow architecture, and user control. To address these challenges, we have developed snpQT: a scalable, stand-alone software pipeline using nextflow and BioContainers, for comprehensive, reproducible and interactive quality control of human genomic data. snpQT offers some 36 discrete quality filters or correction steps in a complete standardised pipeline, producing graphical reports to demonstrate the state of data before and after each quality control procedure. This includes human genome build conversion, population stratification against data from the 1,000 Genomes Project, automated population outlier removal, and built-in imputation with its own pre- and post- quality controls. Common input formats are used, and a synthetic dataset and comprehensive online tutorial are provided for testing, educational purposes, and demonstration. The snpQT pipeline is designed to run with minimal user input and coding experience; quality control steps are implemented with numerous user-modifiable thresholds, and workflows can be flexibly combined in custom combinations. snpQT is open source and freely available at https://github.com/nebfield/snpQT. A comprehensive online tutorial and installation guide is provided through to GWAS (https://snpqt.readthedocs.io/en/latest/), introducing snpQT using a synthetic demonstration dataset and a real-world Amyotrophic Lateral Sclerosis SNP-array dataset.


Introduction
Assuring high quality of genomic data is necessarily a complex multi-step procedure, but it is critical to generate reproducible and reliable results in genome-wide association studies (GWAS).Multiple challenges are encountered in carrying out QC ( [1]).Although there are well-established steps and good practices ( [2,3]), there is no standardised and universally followed workflow, contributing to low reproducibility of results.Existing approaches, including semi-automated tools ( [4]), can involve a time-consuming "trial and error" approach, requiring the analyst to check the distributions of parameters in plots produced over many rounds of adjustments, and to manually enter commands in a long list of QC steps one-by-one or in a series of shell scripts.The analyst may encounter incompatibility problems and installation difficulties.Software architecture tools such as nextflow and BioContainers can address these issues and have been proposed as automated solutions ( [5]), but limitations exist in terms of limited and relatively rigid QC analysis, lacking such steps as imputation, limited variety of threshold choice and plot outputs, and the requirement for users to have extensive knowledge of the software in order to tailor their analysis.

Methods
snpQT was developed as a set of nine core workflow components implemented with the nextflow workflow management system ( [6]).Each workflow component consists of independent containerised modules, using BioContainers curated by the bioinformatics community wherever possible ( [7]).Nextflow allows snpQT to be easily scaled from a laptop to a high-performance computing (HPC) or cloud environment, and enables caching at continuous checkpoints, so users can alter thresholds without needing to rerun earlier parts of the analysis.
All nine workflows are illustrated in Figure 1.Workflow A runs only once, performing a local database set up, downloading and preparing reference files ( [8,9]) and setting up specific versions of tools using conda or docker.snpQT processes data in human genome build 37, but Workflow B has been created for the user to convert from build 38 to 37 or vice versa.Workflow C performs sample QC, including checks for missing call rate, sex discrepancies, heterozygosity, cryptic relatedness, and missing phenotypes.Workflow D performs population stratification for the automatic removal of samples that are predicted as ethnic outliers (using EIGENSOFT, [10]).Workflow E performs the main Variant QC, checking missing call rate, Hardy-Weinberg equilibrium deviation, minor allele frequency, missingness in case/control status, and generates covariates for GWAS, based on a user-modifiable number of Principal Components (or users may provide a covariates file).Workflow F is for pre-imputation quality control, while workflow G performs local phasing and imputation using shapeit4 ( [11]) and impute5 ( [12]), and workflow H performs post-imputation QC.The workflows structure also allows for users to upload their data to an external imputation server, or use a different reference panel.Workflow I performs GWAS, outputting summary statistics, along with a Manhattan plot and a QQ-plot.Detailed summary logs and graphs are provided throughput, depicting the total number of samples and variants in each step, and prompting users towards the locations of intermediate files and logs.
We demonstrate snpQT using a synthetic dataset which is available with the tool, and an Amyotrophic Lateral Sclerosis SNP-array dataset of 2,000 samples (1,000 cases and 1,000 controls) taken from a restricted-access dbGaP project ( [16]), at https://snpqt.readthedocs.io/en/latest/.

Conclusion
snpQT offers robust QC combined with scalability, reproducibility, flexibility and user-friendly design which can appeal to a broad spectrum of users.It is stand-alone software that needs neither additional coding nor manual installation/download of any data or other program apart from nextflow and conda or docker.The input is a VCF file and/or binary plink files, formats which are widely used.For users who have limited experience with QC analysis a thorough "how-to" guide and step-by-step tutorials are provided, using the demonstration dataset that is available with the tool.

Figure 1 :
Figure 1: Outline of the snpQT architecture, which includes nine core workflows (A-I) that are implemented using nextflow.Each workflow expects specific inputs either from the user or from the outputs generated by other workflows.Main tools and key processes (modules) are highlighted in green.Examples of different task combinations are represented in the upper-right corner, showing the flexibility and interactivity among the implemented workflows.VCF: Variant Call File; QC: Quality Control; PCA: Principal Component Analysis, bfiles: Binary PLINK files.