ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Note
Revised

The ICR142 NGS validation series: a resource for orthogonal assessment of NGS analysis

[version 2; peer review: 2 approved]
PUBLISHED 05 Sep 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Data: Use and Reuse collection.

Abstract

To provide a useful community resource for orthogonal assessment of NGS analysis software, we present the ICR142 NGS validation series. The dataset includes high-quality exome sequence data from 142 samples together with Sanger sequence data at 704 sites; 416 sites with variants and 288 sites at which variants were called by an NGS analysis tool, but no variant is present in the corresponding Sanger sequence. The dataset includes 293 indel variants and 247 negative indel sites, and thus the ICR142 validation dataset is of particular utility in evaluating indel calling performance. The FASTQ files and Sanger sequence results can be accessed in the European Genome-phenome Archive under the accession number EGAS00001001332.

Keywords

Variant calling, next-generation sequencing, NGS, exome, indel, validation

Revised Amendments from Version 1

We recently released ICR142 Benchmarker, a tool that uses the ICR142 NGS validation series to generate standardised outputs and metrics to evaluate, optimise and benchmark variant calling algorithms. In the development of ICR142 Benchmarker we refined the ICR142 NGS validation series to maximise its utility, based on detailed re-review of all the data and user feedback.
 
The methods used in the data review are detailed in the amended paper and led to the exclusion of 26 sites such that the ICR142 NGS validation series now includes 704 sites: 416 sites with variants and 288 sites at which variants were called by an NGS analysis tool, but no variant is present in the corresponding Sanger sequence. This amended version of paper describes the updated dataset.
 
ICR142 Benchmarker is described in Ruark E, Holt E, Renwick A et al. ICR142 Benchmarker: evaluating, optimising and benchmarking variant calling using the ICR142 NGS validation series [version 1; referees: awaiting peer review]. Wellcome Open Res 2018, 3:108 (doi: 10.12688/ wellcomeopenres.14754.1)

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

Next-generation sequencing (NGS) approaches have greatly enhanced our ability to detect genetic variation. Over the past decade NGS hardware, software, throughput, data quality and analytical tools have evolved dramatically. Thorough evaluation of each new laboratory and analytical development is challenging but necessary to fully understand how pipeline modification can impact results. To fully assess performance, NGS analysis tools should ideally be run on samples with pre-determined positive and negative sites assessed through orthogonal experimentation such as Sanger sequencing.

Over the past five years, we have generated extensive data on thousands of samples using different NGS instruments, sequencing chemistry, gene panels, exome captures and variant calling tools. Fortuitously, during this process we have generated orthogonal validation data using Sanger sequencing for a core set of 142 samples that were included in the majority of our experiments. We now formally use these samples, which we call the ICR142 NGS validation series, to evaluate NGS variant calling performance after any change to experimental or analytical protocols. This series has proved an extremely useful resource for our assessment of NGS analysis in both the research and clinical settings. We believe that it may also have utility for others, and hence are making it available here.

Materials and methods

We used lymphocyte DNA from 142 unrelated individuals. All individuals were recruited to the BOCS study and have given informed consent for their DNA to be used for genetic research. The study is approved by the London Multicentre Research Ethics Committee (MREC/01/2/18).

Over the last five years we have generated data from the ICR142 validation series using different exome captures which we have analysed with multiple aligner/caller combinations16. To date we have generated Sanger sequence data for 704 sites amongst the 142 individuals. These sites include variants called by only one aligner and caller combination, increasing the representation of sites which can discriminate performance between methods.

To generate the Sanger sequence data, we performed PCR reactions using the Qiagen Multiplex PCR kit, and bidirectional sequencing of resulting amplicons using the BigDye terminator cycle sequencing kit and an ABI3730 automated sequencer (ABI PerkinElmer). All sequencing traces were analysed with both automated software (Mutation Surveyor version 3.10, SoftGenetics) and visual inspection.

To determine if a variant was present we visually inspected each Sanger sequence with Chromas software v2.13. For each site we selected an ENST from release 65 as the reference sequence. We reviewed at least 100 base pairs of sequence flanking each variant site to allow for position/annotation errors. We considered a base substitution to be confirmed if the correct variant was called at the exact position and the variant base signal was accompanied by a corresponding reduction in the reference base signal. There were 123 confirmed base substitution variants. We considered an indel variant to be confirmed if an indel variant was present in the region of interest and the indel variant allele signal was present along the complete length of the region of interest. There were 293 confirmed indel variants.

We considered a site negative for a base substitution if the specific base substitution was not present, resulting in 41 negative base substitution sites. We considered a site negative for an indel if no indel, of any kind, was detected in the 200 base pair region of interest, resulting in 247 negative indel sites (Figure 1). We annotated confirmed variants with the HGVS-compliant CSN standard using CAVA (version 1.1.0) according to the transcripts designated in Supplementary table 17.

37ebfc9f-e4ab-4ae3-8371-56ccb36e3147_figure1.gif

Figure 1. Description of variant sites evaluated by Sanger sequencing in the ICR142 NGS validation series.

We have also generated high-quality exome sequencing data for the ICR142 NGS validation series. We prepared DNA libraries from 1.5 µg genomic DNA using the Illumina TruSeq sample preparation kit. DNA was fragmented using Covaris technology and the libraries were prepared without gel size selection. We performed target enrichment in pools of six libraries (500 ng each) using the Illumina TruSeq Exome Enrichment kit. The captured DNA libraries were PCR amplified using the supplied paired-end PCR primers. Sequencing was performed with an Illumina HiSeq2000 (SBS Kit v3, one pool per lane) generating 2×101 bp reads. CASAVA v1.8.1 (Illumina) was used to demultiplex and create FASTQ files per sample from the raw base call files.

All of the 704 sites had at least 15× coverage in the exome data, defined as at least 15 reads of good mapping quality (mapping score ≥20). Because these sites are well covered, we can readily assess the variant calling performance of any software tool by applying the pipeline to the exome sequencing data and comparing the variant calls with the Sanger sequencing dataset.

Data availability

We have deposited the FASTQ files for all 142 individuals in the European Genome-phenome archive (EGA). The accession number is EGAS00001001332.

Researchers and authors that use the ICR142 NGS validation series should reference this paper and should include the following acknowledgement: "This study makes use of the ICR142 NGS validation series data generated by Professor Nazneen Rahman’s team at The Institute of Cancer Research, London”.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 22 Mar 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Ruark E, Renwick A, Clarke M et al. The ICR142 NGS validation series: a resource for orthogonal assessment of NGS analysis [version 2; peer review: 2 approved]. F1000Research 2018, 5:386 (https://doi.org/10.12688/f1000research.8219.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 22 Mar 2016
Views
40
Cite
Reviewer Report 03 May 2016
Brad Chapman, Department of Biostatistics, Harvard Public School of Health, Boston, MA, USA 
Oliver Hofmann, Wolfson Wohl Cancer Research Centre, Institute of Cancer Sciences, University of Glasgow, Glasgow, UK 
Approved
VIEWS 40
The authors describe ICR142, a publicly available set of fastq files and confirmed true and false variants for validating analysis pipelines. This is an incredibly useful community resource that complements existing efforts like the Genome in a Bottle project by ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Chapman B and Hofmann O. Reviewer Report For: The ICR142 NGS validation series: a resource for orthogonal assessment of NGS analysis [version 2; peer review: 2 approved]. F1000Research 2018, 5:386 (https://doi.org/10.5256/f1000research.8841.r13013)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
31
Cite
Reviewer Report 21 Apr 2016
Richard Bagnall, Agnes Ginges Centre for Molecular Cardiology, Centenary Institute, Sydney Medical School, The University of Sydney, Sydney, NSW, Australia 
Approved
VIEWS 31
A myriad of software tools have been developed for the alignment of next generation sequencing data to a reference genome and for the subsequent genotyping of DNA variants. Evaluating the specificity and sensitivity of a variant calling framework can be ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Bagnall R. Reviewer Report For: The ICR142 NGS validation series: a resource for orthogonal assessment of NGS analysis [version 2; peer review: 2 approved]. F1000Research 2018, 5:386 (https://doi.org/10.5256/f1000research.8841.r13347)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 22 Mar 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.