ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline

[version 2; peer review: 2 approved]
PUBLISHED 10 Feb 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Software and Hardware Engineering gateway.

This article is included in the Bioinformatics gateway.

Abstract

Compound Heterozygous (CH) variant identification requires distinguishing maternally from paternally derived nucleotides, a process that requires numerous computational tools. Using such tools often introduces unforeseen challenges such as installation procedures that are operating-system specific, software dependencies that must be installed, and formatting requirements for input files. To overcome these challenges, we developed Compound Heterozygous Variant Identification Pipeline (CompoundHetVIP), which uses a single Docker image to encapsulate commonly used software tools for file aggregation (BCFtools or GATK4), VCF liftover (Picard Tools), joint-genotyping (GATK4), file conversion (Plink2), phasing (SHAPEIT2, Beagle, and/or Eagle2), variant normalization (vt tools), annotation (SnpEff), relational database generation (GEMINI), and identification of CH, homozygous alternate, and de novo variants in a series of 13 steps. To begin using our tool, researchers need only install the Docker engine and download the CompoundHetVIP Docker image. The tools provided in CompoundHetVIP, subject to the limitations of the underlying software, can be applied to whole-genome, whole-exome, or targeted exome sequencing data of individual samples or trios (a child and both parents), using VCF or gVCF files as initial input. Each step of the pipeline produces an analysis-ready output file that can be further evaluated. To illustrate its use, we applied CompoundHetVIP to data from a publicly available Ashkenazim trio and identified two genes with a candidate CH variant and two genes with a candidate homozygous alternate variant after filtering based on user-set thresholds for global minor allele frequency, Combined Annotation Dependent Depletion, and Gene Damage Index. While this example uses genomic data from a healthy child, we anticipate that most researchers will use CompoundHetVIP to uncover missing heritability in human diseases and other phenotypes. CompoundHetVIP is open-source software and can be found at https://github.com/dmiller903/CompoundHetVIP; this repository also provides detailed, step-by-step examples.

Keywords

Genetics, compound heterozygous, genome analysis, trio, phasing, reproducibility

Revised Amendments from Version 1

We updated the abstract to get to the point more quickly, to mention the tools used for each major step, and to clarify that MAF and CADD are used as part of variant filtering. We updated the manuscript to mention that whole-exome and targeted exome data can also be used. We changed the way we filter the variants to be more simple but also more strict. Now, CADD and MAF scores must be available for all variants, variants must be in exonic regions, and variants can have either “MED” or “HIGH” putative impact. This update to filtering did not alter the results in our example analyses. There was an error in “identify_homAlt_variants.py” that required a gene to have 2 variants in order to be written to the output file. This has been fixed, and it slightly changed the example results. Originally we had reported only 1 gene with a homozygous alternate variant in the example dataset. However, after updating the script and running it again, we identified 2 genes that have a homozygous alternate variant. The manuscript has been updated accordingly. In response to reviewer comments, we have updated the “gemini_load.py” script with an optional argument to allow users to import a file that was annotated with VEP. Since SnpEff is used as part of the pipeline, the script defaults to snpEff, but if a user chooses to use VEP outside of the pipeline, then they can still use the remaining steps of the pipeline. We have updated Figure 1 to show step numbers. We have updated the manuscript to say “global minor allele frequency” when first referenced. Clarifications about phasing were made to both the main text and example PDF. Other minor updates were made to the example PDF and manuscript for added clarity and explanation.

See the authors' detailed response to the review by Soohyun Lee
See the authors' detailed response to the review by Sushant Patil

Introduction

A compound heterozygous (CH) variant occurs when a person inherits two alleles, one from each parent, and these alleles are located at different positions within the same gene1. The compound effects of these alternate alleles may lead to phenotypic effects as seen in some cases of human disease, including ataxia telangiectasia, NGLY1 deficiency, and various types of pediatric cancer24. For example, CH variants in the mismatch repair gene, MSH6, have been identified in pediatric patients with colorectal cancer, medulloblastoma, high-grade glioma, glioblastoma, non-Hodgkin’s lymphoma, and acute lymphoblastic leukemia4. To detect CH variants in next-generation sequencing data, it is necessary to differentiate between paternally and maternally derived nucleotides1. Laboratory-based methods such as fosmid-pool-based or linked-read sequencing can be used; however, if DNA libraries are prepared and sequenced without regard to nucleotide inheritance (as is done in most sequencing projects), computational methods can help determine parental inheritance through haplotype estimation (“phasing”)57.

Available phasing tools require specific input file types (such as VCF or Plink files) and reference files which are not standardized across different phasing software. In addition, many phasing programs require that input files have been aligned to a specific reference genome, do not contain multiallelic positions, are free of repeat positions, and that each chromosome be phased separately810. Figuring out how to prepare files for phasing can be challenging as passing files from program to program may result in unforeseen incompatibilities. Additionally, installing some programs can be challenging because of operating-system specific installation processes and software dependencies.

We have designed Compound Heterozygous Variant Identification Pipeline (CompoundHetVIP) to help researchers overcome these time-consuming challenges when identifying CH variants. CompoundHetVIP encapsulates specific versions of existing tools, required software dependencies, and custom Python scripts into a cohesive computational environment packaged as a Docker image11. Accordingly, researchers need only install the Docker software and download the CompoundHetVIP Docker image to begin performing CH, homozygous alternate, and de novo analyses at the command line. Furthermore, because the source code for CompoundHetVIP is publicly available, other researchers will be able to reproduce the analyses and investigate the specific methodologies used.

Methods

Implementation

The functionality of CompoundHetVIP is divided into a series of 13 steps (Figure 1). For each step, a Python script is executed within a Docker container. These scripts provide logic for processing data files and invoking third-party tools. By breaking the pipeline into 13 steps, users have flexibility to perform the steps that are most relevant to their analysis. For example, researchers can use input data for an affected individual only or for a trio (an affected individual and both parents). If parental data are unavailable and the variant positions within the VCF file correspond to genome build GRCh37, users may skip the first three steps. A detailed, step-by-step guide is available on GitHub and as Extended data12.

8e12811a-8508-4b2e-9d49-ff53d897fae8_figure1.gif

Figure 1. Flow diagram of CompoundHetVIP functionality.

Workflow

For step 1, the inputs can be either Variant Call Format (VCF)13 or gVCF14 files that were generated from whole-genome, whole-exome, or targeted exome sequencing data. VCF files contain variant sites only, whereas gVCF files include non-variant sites, too. For each parent of a trio being evaluated, our script retains nucleotide positions that are in common with the child. When gVCF files are used (whether for trios or individuals), our script removes all non-variant sites for the child (but retains these for the parents to support determination of CH status). When applied to trio data, some phasing tools, such as SHAPEIT28, require a single input file for each trio. Therefore, in step 2 (used only when working with trios), we combine the variant files for each member of a trio into a single VCF file using either BCFtools (VCF input files)15 or GATK4 (gVCF input files)14. If GATK4 is used, joint-genotyping is also performed on the trio VCF.

The remaining steps can be applied either to trios or individuals. Some phasing and annotation programs require that data be aligned to genome build GRCh37; thus, we use this reference genome as our standard. For variant files that have been aligned to genome build GRCh38, step 3 uses Picard Tools16 to convert the data to GRCh37 positions using a lift-over process. During lift-over, some sites may be present in GRCh38, but their exact position in GRCh37 is unknown. To avoid ambiguity, these sites are removed during step 4. This step also removes positions that are multiallelic or duplicated to maintain compatibility with programs such as Plink217,18 and SHAPEIT2 (used in steps 5 and 6, respectively). For trio VCF files, sites that contain missing genotype information (i.e. “./.”) for both parents are removed to improve phasing accuracy.

CompoundHetVIP can perform phasing using SHAPEIT2, Eagle210, and/or Beagle9. Each of these programs requires that each chromosome be phased independently. Additionally, when using SHAPEIT2, it is recommended that PLINK files (.bed, .bim, .fam) be used as input for phasing. Therefore, step 5 divides a VCF file by chromosome into multiple files and creates the necessary PLINK files for each chromosome (when SHAPEIT2 is used for phasing).

Step 6 phases the variants in each chromosome using default parameters for the phasing program chosen by the user. We recommend using SHAPEIT2 because it can applied either to trios or individuals. When parents’ genotypes are available, this program uses Mendelian logic for phasing and a population-based haplotyped reference panel when the phase of the child cannot be determined from Mendelian logic alone (i.e. both parents and child are heterozygous). In addition, if a parent is missing genotype information at a position, SHAPEIT2 imputes the missing information. All supported phasing programs integrate the 1000 Genomes Project phase 3 haplotype reference panel19 and do not require sequence alignment files (.bam), such as those required by read-based programs20,21. In some scenarios, SHAPEIT2 switches the REF and ALT alleles. Therefore, step 7 ensures that the REF/ALT alleles of the phased VCF files are congruent with those of the reference genome. Also, sites with Mendelian errors are removed.

To make subsequent analysis of the phased files easier, step 8 concatenates all phased chromosomes into a single file. If a user is analyzing multiple trios (or individuals), this script can also merge the data for these trios (or individuals) into a single VCF file.

Step 9 normalizes VCF files as recommended by GEMINI22 (used in step 11). Normalization involves left-alignment and trimming of variants23. This process helps ensure that variants are represented at their left-most position, with as few nucleotides as possible, and unambiguously. This step uses vt tools23. In step 10, SnpEf24 provides information about the effects of variants on function for known genes. Then, in step 11, GEMINI22 loads the annotated VCF into a relational database (GEMINI can also load files annotated with Variant Effect Predictor (VEP)25, although VEP is not available as part of our pipeline). Step 12 uses a custom Python script to extract CH variant data from the database. Our provided script identifies CH variants and filters the data based on user-set thresholds for global minor allele frequency (MAF) and Combined Annotation Dependent Depletion (CADD) scores26. Variants with a MAF less than or equal to the user-set threshold, CADD score greater than or equal to the user-set threshold, exonic classification, and “HIGH” or “MED” putative impact severity are included in the final output. We consider the variants in the final output as candidates for further evaluation. For step 12, we provide two additional scripts that identify homozygous alternate variants and de novo variants using the same user-set thresholds as those described above.

Finally, in step 13, we add Gene Damage Index (GDI) scores27 and gene-length information to the output files. GDI scores quantify accumulated mutational damage in healthy populations as a way to predict whether genes are likely to have disease-causing variants. Genes of longer length (e.g. TTN, MUC5B) tend to have more total damage but typically less disease-causing damage than shorter genes.

Operation

Because CompoundHetVIP executes all scripts within a Docker container, it can be executed on all major operating systems that are commonly used for scientific computing. Depending on input file sizes, the hardware needed to execute CompoundHetVIP will vary from user to user. Certain tasks, such as phasing (step 6), can be memory intensive. A minimum of 40 GB memory is recommended. When creating a relational database with GEMINI (step 11), there is no minimum processing core recommendation, but multiprocessing can significantly speed up the time it takes to load the database. Users can specify how many processing cores GEMINI can use when executing step 11.

Results

We applied CompoundHetVIP to high-confidence, VCF data that were generated with whole-genome sequencing data from an Ashkenazim trio available through the Genome in a Bottle Consortium28. During step 6, we used SHAPEIT2 to phase the data. In the child of this trio, we identified a CH variant in two genes (FLNB and TTN) using a MAF threshold of 0.01 and a CADD score threshold of 15. Genes with a GDI score less than or equal to 13.84 are classified as being more likely to have disease-causing damage from variants27. FLNB (6.2) was lower than this threshold but TTN (42.9) was not. FLNB has an important role in cytoskeleton development and variations in this gene have been associated with many skeletal disorders29,30.

In addition, we identified two homozygous alternate variants: one in TBC1D2 and the other in TOX2, using the same MAF and CADD thresholds that we used for CH variant identification. TBC1D2 and TOX2 had GDI scores of 9.7 and 4.4, respectively. TBC1D2 codes for a GTPase-activating protein and is involved in E-cadherin degradation31. The role of this gene and how it may relate to human disease is not yet fully understood. TOX2 is a transcription factor that helps drive the development of T follicular helper (Tfh) cells32. Tfh cells are an important part of humoral immunity.

Using the same MAF and CADD thresholds described above, we sought to identify de novo variants in this trio. However, none passed these thresholds.

Conclusion

CompoundHetVIP provides the necessary tools for CH variant identification using VCF or gVCF files as initial input and is executed within a Docker container, which allows for cross-platform compatibility and reproducibility. CompoundHetVIP involves 13 steps (Figure 1) that include a breadth of tasks such as file aggregation, VCF liftover, joint-genotyping, file conversion, phasing, variant normalization, annotating, and variant identification. Our results highlight that potentially damaging CH and homozygous alternate variants are observed in seemingly healthy individuals. However, we anticipate that most researchers will use CompoundHetVIP to identify variants in individuals with a known disease.

Data availability

Extended data

Zenodo: dmiller903/CompoundHetVIP: CompoundHetVIP - v1.1. https://doi.org/10.5281/zenodo.447768612.

This project contains the following extended data:

Software availability

Software available from: https://hub.docker.com/r/dmill903/compound-het-vip

Source code available from: https://github.com/dmiller903/CompoundHetVIP

Archived source code at time of publication: https://doi.org/10.5281/zenodo.447768612.

License: MIT

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 08 Oct 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Miller DB and Piccolo SR. CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline [version 2; peer review: 2 approved]. F1000Research 2021, 9:1211 (https://doi.org/10.12688/f1000research.26848.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 10 Feb 2021
Revised
Views
3
Cite
Reviewer Report 24 Feb 2021
Soohyun Lee, Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA 
Approved
VIEWS 3
I thank the authors for addressing all my ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lee S. Reviewer Report For: CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline [version 2; peer review: 2 approved]. F1000Research 2021, 9:1211 (https://doi.org/10.5256/f1000research.54440.r79307)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 08 Oct 2020
Views
17
Cite
Reviewer Report 25 Jan 2021
Sushant Patil, Department of Pathology and Laboratory Medicine, University of North Carolina, Chapel Hill, North Carolina, USA 
Approved
VIEWS 17
Thank you for the opportunity to review the manuscript "CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline" by Miller et al.

I find the published software to be of great utility, even though it may not have scientific novelty ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Patil S. Reviewer Report For: CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline [version 2; peer review: 2 approved]. F1000Research 2021, 9:1211 (https://doi.org/10.5256/f1000research.29647.r76628)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 10 Feb 2021
    Dustin Miller, Department of Biology, Brigham Young University, Provo, 84602, USA
    10 Feb 2021
    Author Response
    Thank you for taking the time to review our manuscript. We appreciate your thorough and helpful comments. We have addressed these comments below in a point-by-point response. These changes have ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 10 Feb 2021
    Dustin Miller, Department of Biology, Brigham Young University, Provo, 84602, USA
    10 Feb 2021
    Author Response
    Thank you for taking the time to review our manuscript. We appreciate your thorough and helpful comments. We have addressed these comments below in a point-by-point response. These changes have ... Continue reading
Views
19
Cite
Reviewer Report 20 Jan 2021
Soohyun Lee, Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA 
Approved with Reservations
VIEWS 19
The paper describes a software pipeline that identifies compound heterozygous (CH) variants from the VCF or gVCF files of whole genome sequencing data from either a single sample or a trio. The pipeline has 13 steps and in brief, performs ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lee S. Reviewer Report For: CompoundHetVIP: Compound Heterozygous Variant Identification Pipeline [version 2; peer review: 2 approved]. F1000Research 2021, 9:1211 (https://doi.org/10.5256/f1000research.29647.r76619)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 10 Feb 2021
    Dustin Miller, Department of Biology, Brigham Young University, Provo, 84602, USA
    10 Feb 2021
    Author Response
    Thank you for taking the time to review our manuscript. We appreciate your thorough and helpful comments. We have addressed these comments below in a point-by-point response. These changes have ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 10 Feb 2021
    Dustin Miller, Department of Biology, Brigham Young University, Provo, 84602, USA
    10 Feb 2021
    Author Response
    Thank you for taking the time to review our manuscript. We appreciate your thorough and helpful comments. We have addressed these comments below in a point-by-point response. These changes have ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 08 Oct 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.