ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

SNPnotes: high-throughput tissue-specific functional annotation of single nucleotide variants

[version 1; peer review: 2 approved with reservations]
* Equal contributors
PUBLISHED 22 Oct 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research to the People collection.

Abstract

A key challenge in the application of whole-genome sequencing (WGS) for clinical diagnostic and research is the high-throughput prioritization of functional variants in the non-coding genome. This challenge is compounded by context-specific genetic modulation of gene expression, and variant-gene mapping depends on the tissues and organ systems affected in a given disease; for instance, a disease affecting the gastrointestinal system would use maps specific to genome regulation in gut-related tissues. While there are large-scale atlases of genome regulation, such as GTEx and NIH Roadmap Epigenomics, the clinical genetics community lacks publicly-available stand-alone software for high-throughput annotation of custom variant data with user-defined tissue-specific epigenetic maps and clinical genetic databases, to prioritize variants for a specific biomedical application. In this work, we provide a simple software pipeline, called SNPnotes, which takes as input variant calls for a patient and prioritizes those using information on clinical relevance from ClinVar, tissue-specific gene regulation from GTEx and disease associations from the NHGRI-EBI GWAS catalogue. This pipeline was developed as part of SVAI Research's "Undiagnosed-1" event for collaborative patient diagnosis. We applied this pipeline to WGS-based variant calls for an individual with a history of gastrointestinal symptoms, using 12 gut-specific eQTL maps and GWAS associations for metabolic diseases, for variant-gene mapping. Out of 6,248,584 SNPs, the pipeline identified 151 high-priority variants, overlapping 129 genes. These top SNPs all have known clinical pathogenicity, modulate gene expression in gut tissues and have genetic associations with metabolic disorders, and serve as starting points for hypotheses about mechanisms driving clinical symptoms. Simple software changes can be made to customize the pipeline for other tissue-specific applications. Future extensions could integrate maps of tissue-specific regulatory elements, higher-order chromatin loops, and mutations affecting splice variants.

Keywords

bioinformatics, genomics, genetics, GWAS, variant annotation, SNPs, software, clinical genetics, epigenetics, epigenomics

Introduction

Genome sequencing has become an invaluable tool for clinical diagnostics. Several methodologies exist to look at the human genome, each with own benefits and pitfalls. Whole exome sequencing has been a cost-effective technique to identify structural and nucleotide variants in protein-coding regions of the genome, and has identified variants associated with a number of diseases, including psoriasis1, Factor V Leiden thrombophilia2, and Miller Syndrome3. Genome-wide associations studies, originally based on SNP microarrays, have found that roughly half of genetic associations with disease are located outside gene bodies; this fraction approaches 90% with the inclusion of intronic regions4. With dropping costs for DNA sequencing, whole genome sequencing (WGS) promises to extend the ability to identify disease-associated single and structural nucleotide variants in the clinic, to non-coding regions of the genome, including gene and chromatin regulatory sequences. However, the increased size and complexity of the data creates a parallel challenge of annotating non-coding variants, as it requires knowledge of tissue-specific gene regulation (or epigenetics), including regulatory elements such as promoters and enhancers, and features of higher order chromatin organization such as Topological Associated Domains.

Purpose and approach

Several large-scale epigenomics projects have catalogued tissue-specific regulatory elements. This includes genetic modulation of gene expression (GTEx project)5, chromatin state (Roadmap Epigenomics)6, and enhancer-promoter loops for mapping of distal regulatory elements to genes (FANTOM and individual studies)7,8. A computational workflow that integrated these functional annotation maps to annotate variants from WGS assays would be a valuable resource to prioritize variants with potential functional impact in tissues of interest. Popular high-throughput variant annotation tools, such as BioMart and Variant Effect Predictor9,10, do not provide tissue-specific annotation. While FUMA11 integrates comprehensive epigenetic annotation, it is a web-based service used to annotate top-ranking variants from GWAS studies, rather than being a standalone tool for variant annotation.

In this work, we describe and provide variant annotation software that starts with output from a WGS assay and prioritizes variants based on epigenomic resources described above, as well as clinical genetic and GWAS catalogs of variant-disease association. This tool will allow users to capture functional and clinical information and to analyze variants simply by providing the commonly-used VCF file format. We demonstrate the software's functionality by prioritizing variants from WGS data for a single patient.

Methods

This work was undertaken as part of SVAI Undiagnosed-1 (https://sv.ai/undiagnosed-1), which was a collaborative event with the goal of diagnosing a patient with an unknown genetic condition. As data, participant groups were provided with detailed medical history, genotyping and metabolic data from a 33-year old Caucasian male patient, JCM. The event was hosted in June 2019 by the not-for-profit organization SVAI (http://sv.ai), with participants located in the San Francisco Bay Area (USA) as well as in Toronto, Canada.

Pipeline

Figure 1 shows the workflow for the pipeline. Patient genotypes are provided in Variant Call Format as input. Conceptually, the tool compiles prior knowledge about the functional significance of variants from the perspective of tissue-specific regulatory information, known genetic disease associations in the literature, tissue-specific genetic modulation of gene expression, and clinical pathogenicity. The annotation sources are integrated with the genotype calls in the VCF file, and this integration results in a single output table with available annotation for the variant. The user can then prioritize variants based on the combination of known or predicted functional consequences.

c21d12ef-2a6b-4732-81c9-4e1c84f9c547_figure1.gif

Figure 1. Workflow.

Tissue-specific regulatory regions

The NIH Roadmap Epigenomics project performed comprehensive mapping of noncoding DNA in 111 epigenomes to identify putative regulatory elements in diverse human tissue and cell types (http://www.roadmapepigenomics.org/)6. The presence of a variant overlapping a tissue-specific promoter or enhancer element signifies that variant alteration could change the regulation of a tissue-specific gene, i.e. that it has regulatory impact on gene expression. Tissue-specific 15-state chromatin state models were downloaded from the Roadmap Epigenomics Portal (downloaded from https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/all.mnemonics.bedFiles.tgz). As the symptoms for John M, hereafter referred to as “the Patient”, included impaired function of the gastrointestinal tract, we limited our annotation to tissues and organs of the digestive system (e.g. esophagus, stomach, intestine). Chromatin states for digestive system tissues were used, including fetal stomach, small and large intestine, sigmoid colon, colonic mucosa, mucosa from stomach, duodenum and rectum, esophagus, rectal mucosa, and stomach mucosa. Regions with open chromatin were included for variant annotation (Table 1; states ≤ 1–7).

Table 1. Samples from the NIH Roadmap Epigenomics project used for gut-specific regulatory region annotation.

Sample NumberTissue name
E075Colonic mucosa
E077Duodenum Mucosa
E079Esophagus
E084Fetal large intestine
E085Fetal small intestine
E092Fetal stomach
E094Gastric
E101Rectal Mucosa Donor 29
E102Rectal Mucosa Donor 31
E106Sigmoid colon
E109Small intestine
E110Stomach Mucosa

Genetic modulation of gene expression in the gut

The GTEx project identified variants that significantly modulate gene expression in each of 44 human tissues5. Variants that modulate transcription in gut tissues (gut eQTLs) were downloaded from the GTEx portal (downloaded from https://storage.googleapis.com/gtex_analysis_v7/single_tissue_eqtl_data/GTEx_Analysis_v7_eQTL.tar.gz; Table 2).

Table 2. Samples from the GTEx dataset used to obtain significant gut expression QTLs (eQTLs).

File names
Colon_Sigmoid.v7.signif_variant_gene_pairs.txt.gz
Colon_Transverse.v7.signif_variant_gene_pairs.txt.gz
Esophagus_Muscularis.v7.signif_variant_gene_pairs.txt.gz
Esophagus_Mucosa.v7.signif_variant_gene_pairs.txt.gz
Small_Intestine_Terminal_Ileum.v7.signif_variant_gene_pairs.txt.gz
Stomach.v7.signif_variant_gene_pairs.txt.gz

Variant disease associations

Genome-wide SNP-disease associations were downloaded from the NHGRI-EBI GWAS catalog12, using the TargetValidation.org API13; only those associations mapped to metabolic disorders were included (EFO:0000589). In addition, information on known clinical pathogenicity was downloaded from the ClinVar database14 (downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz).

Annotation file preparation

The pipeline software was implemented in bash and R 3.4.4. VCFtools v0.1.1515 was used to filter SNPs from the VCF input file. SNP locations were downloaded from the dbSNP 151 database ((downloaded from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/common_all_20180423.vcf.gz). All coordinates are in GRCh37/hg19 build. SNP coordinates were converted to bed format using awk. dbSNP identifiers were converted to genomic coordinates using dbSNP 151 reference (see above). Bedtools v2.28.016 was used to identify overlap of SNP coordinates with individual annotation sources. Finally, R merge was used to join tables by variant location.

Output format

The output file ("final_table.txt") contains SNP coordinates, positionally-overlapping genes, associated clinical significance from ClinVar, associated GWAS trait and p-value, coordinates and state name of overlapping open chromatin states in gut tissues, and name of tissue and genes for significant eQTL associations. The pipeline then filters this file to report only those SNPs with GWAS hits that achieve genome-wide significance (p < 5×10-8); this file is titled "GWASsignficant.txt", and creates a third file with the list of unique genes that meet this criterion ("GWASsignificant_genes.unique.txt"). The user may filter this data still further to identify SNPs with known clinical pathogenicity, as well as SNPs in functionally annotated non-coding regions.

Results

For our test case, we used the SVAI Undiagnosed-1 Patient whole genome sequencing data provided for the Undisclosed-1 hackathon event, hosted by SVAI/Research to the People. We applied our pipeline to chromosomes 1 to 22 and X, Y chromosomes. Out of 6,248,584 SNPs, we identified 151 high-priority variants, overlapping 129 genes, which demonstrate strong evidence for functional significance (Table 3). Therefore, this pipeline allows the user to prioritize certain variants for further downstream analyses, or clinical follow up.

Table 3. Examples of variants prioritized by SNPnotes.

Information includes mapped genes, disease associations from genetic studies, eQTL associations from GTEx, and overlap with open chromatin regions in gut tissues.

VariantGeneGWAS (p < 5 ×10-8)eQTLOpen chromatin
rs964184ZPR1Diabetes mellitus (p < 10-108)EsophagusTranscribed
rs9268645HLA-DRAType I diabetes mellitus (p <10-100)Esophagus muscularis and mucosaWeak transcription
rs4148325UGT1A8,
UGT1A9,
UGT1A5,
UGT1A4,
UGT1A6,
UGT1A3
Obesity (p< 5×10-93)Esophagus mucosaEnhancer
rs964184ZPR1Metabolic syndrome (3×10-31)Esophagus muscularisTranscribed
rs2292239ERBB3Type 1 diabetes
(p < 3×10-27)
Sigmoid colon, Transverse colon,
Small intestine terminal ileum,
Stomach
Transcribed

Conclusions and next steps

This pipeline will allow easy integration of several epigenomic functional annotation maps that could assist in SNV prioritization for clinical and basic research applications. Our provided software can be customized by someone with basic bioinformatics or scripting expertise to generalize to other tissues profiled in the GTEx and NIH Roadmap Epigenomics project.

One beneficial extension would be the identification of variants predicted to affect gene splicing. Such predictions are available in databases of splicing variants, such as dbscSNV17,18, or simple splice site prediction algorithms such as MaxEntScan19,20; the latter uses sequence motifs and maximum entropy calculations for its predictions. This approach is limited by the quality of variant databases, and by models that are limited to predicting only in instances that follow canonical rules for splice site regulation. Another promising avenue to predict aberrant splicing is SpliceAI21, a deep learning-based model that predicts splice junctions from an arbitrary pre-mRNA transcript sequence. Another valuable addition would be that of using higher-order chromatin interaction maps, which allow the mapping of distal regulatory elements, such as enhancers, to genes (e.g. 8).

Ethical statement and consent

This article is based on research that occurred at the Undisclosed-1 hackathon event, hosted by SVAI/Research to the People.

The patient provided written informed consent for data release of their medical records, including genetic results, blood work and clinical laboratory reports, to the organisers of the event (SVAI/Research to the People). This consent included data release to participants of Undisclosed-1 during the event, and subsequently for this data to be hosted by SVAI/Research to the People in an online data repository with restricted access (see details in “Data Availability”).

The patient provided written informed consent for the publication of all articles based on the research that was carried out at Undisclosed-1 and any accompanying images.

Since the medical records are the patient’s property, the patient was fully informed of what data release would entail and written informed consent was obtained for the release of the medical records. No ethical approval was sought for the Undisclosed-1 event or publication of articles relating to this event.

Data availability

Synapse repository: SVAI, Undiagnosed-1 (syn:20554923), https://doi.org/doi:10.7303/syn2055492322.

License information: Since the data contains detailed medical records, access is restricted in order to protect the identity of the patient. Intermediary data is provided throughout the article. In order to access the data, applicants must be registered users of synapse.org and must provide a proposal detailing what the data will be used for. Applicants will also be required to sign a statement that ensures that the data are not shared with others who have not applied to use the data from the SVAI. Please submit applications for data access to hello@sv.ai.

Software availability

Software for this pipeline is available at: https://github.com/shraddhapai/SNPNotes

Archived release at time of publication is located at: http://doi.org/10.5281/zenodo.335227623.

License: MIT

Data in this project contains:

  • data/final_table.txt.informative.txt contains the list of informative genes obtained by running this pipeline on WGS data from the patient at the hackathon.

  • data/NHGRI_GWAS contains precompiled GWAS associations by disease category.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 22 Oct 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Pai S, Apostolides MJ, Jung A and Moss MA. SNPnotes: high-throughput tissue-specific functional annotation of single nucleotide variants [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:1784 (https://doi.org/10.12688/f1000research.20415.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 22 Oct 2019
Views
7
Cite
Reviewer Report 04 Mar 2020
Pavel P. Kuksa, Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA 
Approved with Reservations
VIEWS 7
In this contribution authors propose SNPnotes pipeline for high-throughput tissue-specific functional annotation of single-nucleotide variants.

This work is motivated by a need for a stand-alone software capable of high-throughput annotation of custom WGS variant data.
Several ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kuksa PP. Reviewer Report For: SNPnotes: high-throughput tissue-specific functional annotation of single nucleotide variants [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:1784 (https://doi.org/10.5256/f1000research.22437.r60483)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
21
Cite
Reviewer Report 27 Nov 2019
Deepti Jain, Department of Biostatistics, University of Washington, Seattle, WA, USA 
Approved with Reservations
VIEWS 21
The authors report the workflow and pipeline code they developed to annotate variants. The authors were motivated to develop this pipeline specifically so that a user could use tissue- and disease-specific resources of annotation and annotate a large set of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jain D. Reviewer Report For: SNPnotes: high-throughput tissue-specific functional annotation of single nucleotide variants [version 1; peer review: 2 approved with reservations]. F1000Research 2019, 8:1784 (https://doi.org/10.5256/f1000research.22437.r56414)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 22 Oct 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.