Methods developed during the first National Center for Biotechnology Information Structural Variation Codeathon at Baylor College of Medicine

Medhat Mahmoud; Alejandro Rafael Gener; Michael M. Khayat; Adam C. English; Advait Balaji; Anbo Zhou; Andreas Hehn; Arkarachai Fungtammasan; Brianna Sierra Chrisman; Chen-Shan Chin; Chiao-Feng Lin; Chun-Hsuan Lo; Chunxiao Liao; Claudia M. B. Carvalho; Colin Diesh; David E. Symer; Divya Kalra; Dreycey Albin; Elbay Aliyev; Eric T. Dawson; Eric Venner; Fernanda Foertter; Gigon Bae; Haowei Du; Joyjit Daw; Junzhou Wang; Keiko Akagi; Lon Phan; Michael Jochum; Mohammadamin Edrisi; Nirav N. Shah; Qi Wang; Robert Fullem; Rong Zheng; Sara E Kalla; Shakuntala Mitra; Todd J. Treangen; Vaidhyanathan Mahaganapathy; Venkat Sai Malladi; Vipin K Menon; Yilei Fu; Yongze Yin; Yuanqing Feng; Tim Hefferon; Fritz J. Sedlazeck; Ben Busby

doi:10.12688/f1000research.23773.1

Home Browse Methods developed during the first National Center for Biotechnology...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Methods developed during the first National Center for Biotechnology Information Structural Variation Codeathon at Baylor College of Medicine

[version 1; peer review: 1 approved, 1 approved with reservations]

Medhat Mahmoud¹^*, Alejandro Rafael Gener ^2-5^*, Michael M. Khayat⁶^*, [...] Adam C. English⁷, Advait Balaji⁸, Anbo Zhou⁹, Andreas Hehn¹⁰, Arkarachai Fungtammasan¹¹, Brianna Sierra Chrisman¹², Chen-Shan Chin¹¹, Chiao-Feng Lin¹¹, Chun-Hsuan Lo¹³, Chunxiao Liao¹⁴, Claudia M. B. Carvalho⁶, Colin Diesh¹⁵, David E. Symer¹⁶, Divya Kalra¹, Dreycey Albin¹⁴, Elbay Aliyev¹⁷, Eric T. Dawson^18,19, Eric Venner¹, Fernanda Foertter²⁰, Gigon Bae¹⁰, Haowei Du⁶, Joyjit Daw¹⁰, Junzhou Wang¹¹, Keiko Akagi²¹, Lon Phan²², Michael Jochum²³, Mohammadamin Edrisi⁸, Nirav N. Shah²⁴, Qi Wang²⁵, Robert Fullem⁶, Rong Zheng², Sara E Kalla¹, Shakuntala Mitra²⁶, Todd J. Treangen⁸, Vaidhyanathan Mahaganapathy⁹, Venkat Sai Malladi²⁷, Vipin K Menon¹, Yilei Fu⁸, Yongze Yin⁸, Yuanqing Feng²⁸, Tim Hefferon ²², Fritz J. Sedlazeck ¹, Ben Busby^11,22

Medhat Mahmoud¹^*, Alejandro Rafael Gener ^2-5^*, [...] Michael M. Khayat⁶^*, Adam C. English⁷, Advait Balaji⁸, Anbo Zhou⁹, Andreas Hehn¹⁰, Arkarachai Fungtammasan¹¹, Brianna Sierra Chrisman¹², Chen-Shan Chin¹¹, Chiao-Feng Lin¹¹, Chun-Hsuan Lo¹³, Chunxiao Liao¹⁴, Claudia M. B. Carvalho⁶, Colin Diesh¹⁵, David E. Symer¹⁶, Divya Kalra¹, Dreycey Albin¹⁴, Elbay Aliyev¹⁷, Eric T. Dawson^18,19, Eric Venner¹, Fernanda Foertter²⁰, Gigon Bae¹⁰, Haowei Du⁶, Joyjit Daw¹⁰, Junzhou Wang¹¹, Keiko Akagi²¹, Lon Phan²², Michael Jochum²³, Mohammadamin Edrisi⁸, Nirav N. Shah²⁴, Qi Wang²⁵, Robert Fullem⁶, Rong Zheng², Sara E Kalla¹, Shakuntala Mitra²⁶, Todd J. Treangen⁸, Vaidhyanathan Mahaganapathy⁹, Venkat Sai Malladi²⁷, Vipin K Menon¹, Yilei Fu⁸, Yongze Yin⁸, Yuanqing Feng²⁸, Tim Hefferon ²², Fritz J. Sedlazeck ¹, Ben Busby^11,22

^* Equal contributors

PUBLISHED 16 Sep 2020

Author details Author details

¹ Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, 77030, USA
² Integrative Molecular and Biomedical Sciences Program, Graduate School of Biomedical Sciences, Baylor College of Medicine, Houston, Texas, 77030, USA
³ Margaret M. and Albert B. Alkek Department of Medicine, Nephrology, Baylor College of Medicine, Houston, Texas, 77030, USA
⁴ Department of Genetics, University of Texas MD Anderson Cancer Center, Houston, Texas, 77030, USA
⁵ School of Medicine, Universidad Central del Caribe, Bayamón, Puerto Rico, 00960, USA
⁶ Molecular and Human Genetics Department, Baylor College of Medicine, Houston, Texas, 77030, USA
⁷ Spiral Genetics, Seattle, Washington, 98101, USA
⁸ Department of Computer Science, Rice University, Houston, Texas, 77005, USA
⁹ Department of Genetics, Rutgers University, Piscataway, New Jersey, 08854, USA
¹⁰ NVIDIA, Santa Clara, California, 95051, USA
¹¹ DNAnexus, Mountain view, California, 94040, USA
¹² Department of Bioengineering, Stanford University, Stanford, California, 94305, USA
¹³ Human Genetics Laboratory, National Institute of Genetics, Mishima City, Shizuoka Prefecture, 411-8540, Japan
¹⁴ Rice University, Houston, Texas, 77005, USA
¹⁵ Bioengineering Department, University of California, Berkeley, Berkley, California, USA
¹⁶ Department of Lymphoma & Myeloma, University of Texas MD Anderson Cancer Center, Houston, Texas, 77030, USA
¹⁷ Research Department, Sidra Medicine, Ar-Rayyan, Doha, Qatar
¹⁸ Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, USA
¹⁹ Department of Genetics, University of Cambridge, Cambrige, UK
²⁰ The BioTeam, Inc., Middleton, Massachusetts, 01949, USA
²¹ Department of Thoracic Head and Neck Medical Oncology, University of Texas MD Anderson Cancer Center, Houston, Texas, 77030, USA
²² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, 20894, USA
²³ Department of Obstetrics and Gynecology, Baylor College of Medicine, Houston, Texas, 77030, USA
²⁴ Element Genomics, a UCB subsidiary, Durham, North Carolina, 27701, USA
²⁵ Systems, Synthetic and Physical Biology (SSPB) Graduate Program, Rice University, Houston, Texas, 77005, USA
²⁶ Levitas Bio, Menlo Park, California, 94025, USA
²⁷ Department of Bioinformatics, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas, 75390, USA
²⁸ Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, 19104, USA

Medhat Mahmoud
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Software, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Alejandro Rafael Gener
Roles: Conceptualization, Data Curation, Investigation, Methodology, Project Administration, Resources, Writing – Original Draft Preparation, Writing – Review & Editing

Michael M. Khayat
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Validation, Visualization, Writing – Review & Editing

Adam C. English
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Software

Advait Balaji
Roles: Data Curation, Methodology, Software, Validation, Visualization

Anbo Zhou
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Validation, Visualization

Andreas Hehn
Roles: Methodology, Resources, Software

Arkarachai Fungtammasan
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Validation

Brianna Sierra Chrisman
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Software

Chen-Shan Chin
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Resources, Supervision, Validation, Visualization, Writing – Original Draft Preparation

Chiao-Feng Lin
Roles: Data Curation, Writing – Original Draft Preparation

Chun-Hsuan Lo
Roles: Conceptualization, Investigation, Methodology, Software

Chunxiao Liao
Roles: Conceptualization, Methodology, Validation, Visualization

Claudia M. B. Carvalho
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Supervision, Validation, Writing – Review & Editing

Colin Diesh
Roles: Methodology, Software, Visualization, Writing – Review & Editing

David E. Symer
Roles: Conceptualization, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing

Divya Kalra
Roles: Methodology, Resources, Software

Dreycey Albin
Roles: Data Curation, Methodology, Software, Validation, Visualization

Elbay Aliyev
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization

Eric T. Dawson
Roles: Conceptualization, Methodology, Software, Supervision

Eric Venner
Roles: Resources, Software

Fernanda Foertter
Roles: Supervision

Gigon Bae
Roles: Resources, Software

Haowei Du
Roles: Methodology, Software, Writing – Review & Editing

Joyjit Daw
Roles: Methodology, Resources, Software

Junzhou Wang
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation

Keiko Akagi
Roles: Conceptualization

Lon Phan
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Project Administration, Validation, Writing – Original Draft Preparation

Michael Jochum
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization

Mohammadamin Edrisi
Roles: Methodology, Resources, Software

Nirav N. Shah
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Qi Wang
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Resources

Robert Fullem
Roles: Conceptualization, Software, Validation

Rong Zheng
Roles: Conceptualization, Data Curation, Investigation, Methodology

Sara E Kalla
Roles: Data Curation, Methodology, Resources, Writing – Original Draft Preparation

Shakuntala Mitra
Roles: Conceptualization, Methodology, Resources, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Todd J. Treangen
Roles: Conceptualization, Project Administration, Supervision, Writing – Review & Editing

Vaidhyanathan Mahaganapathy
Roles: Conceptualization, Formal Analysis, Methodology, Software, Supervision

Venkat Sai Malladi
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Software, Validation, Writing – Review & Editing

Vipin K Menon
Roles: Conceptualization, Methodology, Resources, Writing – Original Draft Preparation

Yilei Fu
Roles: Conceptualization, Data Curation, Investigation, Resources, Software, Validation, Visualization

Yongze Yin
Roles: Conceptualization, Data Curation, Investigation, Resources, Software, Validation, Writing – Original Draft Preparation

Yuanqing Feng
Roles: Conceptualization

Tim Hefferon
Roles: Conceptualization, Data Curation, Resources

Fritz J. Sedlazeck
Roles: Conceptualization, Funding Acquisition, Project Administration, Resources, Writing – Original Draft Preparation, Writing – Review & Editing

Ben Busby
Roles: Funding Acquisition

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Hackathons collection.

Abstract

In October 2019, 46 scientists from around the world participated in the first National Center for Biotechnology Information (NCBI) Structural Variation (SV) Codeathon at Baylor College of Medicine. The charge of this first annual working session was to identify ongoing challenges around the topics of SV and graph genomes, and in response to design reliable methods to facilitate their study. Over three days, seven working groups each designed and developed new open-sourced methods to improve the bioinformatic analysis of genomic SVs represented in next-generation sequencing (NGS) data. The groups’ approaches addressed a wide range of problems in SV detection and analysis, including quality control (QC) assessments of metagenome assemblies and population-scale VCF files, de novo copy number variation (CNV) detection based on continuous long sequence reads, the representation of sequence variation using graph genomes, and the development of an SV annotation pipeline. A summary of the questions and developments that arose during the daily discussions between groups is outlined. The new methods are publicly available at https://github.com/NCBI-Codeathons/, and demonstrate that a codeathon devoted to SV analysis can produce valuable new insights both for participants and for the broader research community.

Keywords

Structural Variant, Graph Genome, Human Genomics, Clinical Annotation, Quality Control, Codeathon

Corresponding authors: Alejandro Rafael Gener, Tim Hefferon, Fritz J. Sedlazeck, Ben Busby

Competing interests: ARG received travel awards and bursaries from Oxford Nanopore Technologies, Oxford, UK. FJS received sponsored travel from PacBio and Oxford Nanopore and received the Pacbio SMRT Grant in 2018. NNS is an employee at Element Genomics, Inc., a UCB subsidiary. C-SC, AF, JW, and C-FL are employees of DNAnexus Inc. This material should not be interpreted as representing the viewpoint of the U.S. Department of Health and Human Services, the National Institutes of Health, National Library of Medicine, National Center for Biotechnology Information, National Cancer Institute, Human Genetics Laboratory at the National Institute of Genetics (Japan).

Grant information: ARG is funded in part by institutional support from Baylor College of Medicine; private funding by East Coast Oils, Inc., Jacksonville, Florida, and ARG’s own private funding, including Student Genomics (manuscripts in prep). KA was supported by the National Cancer Institute (NCI) [R50 CA211533]. ARG also received the PFLAG of Jacksonville scholarship for multiple years. CMBC is supported by the National Institute of Neurological Disorders and Stroke (NINDS) [R35 NS105078], the Eunice Kennedy Shriver National Institute of Child Health & Human Development (NICHD) [R03 HD092569] and National Institute of General Medical Sciences (NIGMS) [R01 GM106373]. FJS and MM were supported, and multiple data sets were generated, by NIH [UM1 HG008898]. The work of LP, TH and BB was supported by the Intramural Research Program of the National Library of Medicine at the NIH. YF is supported by Ken Kennedy Institute Computer Science & Engineering Enhancement Fellowship from Rice University.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2020 Mahmoud M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.

How to cite: Mahmoud M, Gener AR, Khayat MM et al. Methods developed during the first National Center for Biotechnology Information Structural Variation Codeathon at Baylor College of Medicine [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2020, 9:1141 (https://doi.org/10.12688/f1000research.23773.1) First published: 16 Sep 2020, 9:1141 (https://doi.org/10.12688/f1000research.23773.1) Latest published: 16 Sep 2020, 9:1141 (https://doi.org/10.12688/f1000research.23773.1)

Introduction

Structural variants (SVs) are large-scale genomic alterations, frequently defined as greater than 50 bases (bp) in length, involving deletions, duplications, insertions, inversions and/or translocations, and can occur in combinations. SVs have been linked to multiple phenotypic differences between organisms, as well as within populations of the same species^1,2. SVs are known to play roles in myriad diseases, including neurological (e.g. Parkinson, Huntington), Mendelian³, and other genomic alterations such as those seen in many cancers and constitutional diseases^4–7. In contrast to single nucleotide variants (SNVs), involving substitution of a single nucleotide, SVs remain understudied due to their more complex nature¹. Our understanding of these larger forms of genomic alterations is limited by the sequencing technology and computational methods available to analyze ever-increasing amounts of sequence or similar data. A special type of SV is copy number variant (CNV). These are unbalanced SVs that could either increase or decrease total DNA content through duplications and deletions, respectively. CNVs of high importance are associated with different physical diseases like obesity^8,9, type 1 diabetes¹⁰, rheumatoid arthritis¹⁰, and neurological disorders such as autism¹¹, schizophrenia¹², Mendelian diseases and other genomic disorders. More general studies examine the relationship between copy number variation and a range of diseases^13,14.

Recent studies report between 20,000 to 25,000 SVs per human genome¹. Although the number of SVs per individual is smaller than that for SNVs, SVs account for more altered nucleotides per diploid genome¹⁵. Recently, there has been a prominent shift towards studying SVs at the population level. The most prominent and early example of this was the 1000 Genomes Project¹⁶, representing different ethnic groups. Other projects include National Institutes of Health (NIH)-sponsored programs such as the National Human Genome Research Institute (NHGRI)’s Centers for Common Disease Genomics (CCDG), and National Heart, Lung, and Blood Institute (NHLBI)’s Trans-Omics for Precision Medicine (TOPMed) program that includes ~155,000 participants sequenced. With data from newer, high-throughput sequencing platforms, investigators are able to capture the genetic variability at a genomic scale of SVs across more geographically and genetically distinct populations. The Icelandic project¹⁷ studied 1,817 Icelandic individuals, including 369 trios (mother+father+child). Accordingly, many tools exist to study SVs in cohorts (e.g. svtools¹⁸ and the StructURal Variant majorIty VOte (SURVIVOR)¹⁹).

Our understanding of genomic variability in humans is tied to the technologies used to study those genomes, typically involving DNA sequencing. Short-read (SR) DNA sequencing (resulting in reads usually less than 150 bp in length) has been the most common way to evaluate DNA samples directly, and RNA samples indirectly after cDNA conversion^20,21. When short reads are mapped to reference genomes, they either map entirely or partially²². Partial mapping can be accomplished by locally aligning part of the read and dropping the rest. This is known as soft-clipping. However, short reads do not align well to large variants when there is a significant gap between the last anchored reference position of the read and position of the soft-clipped portion (e.g. a variant with repetitive sequences longer than the read itself)²³. Therefore, the aligner may not produce a global alignment between the read and reference, either choosing not to map the entire read or leaving the soft-clipped portion unmapped¹.

The linear reference genome (LRG), which is based on a linear coordinate system, is a common way to represent variability within a genome^24,25. While this may be efficient for applications using one or few genomes, more complex applications using multiple LRGs do not scale well. For example, when mapping reads from RNA sequencing (RNA-seq) to a single genome, the use of an LRG requires a baseline amount of compute resources. However, if mapping to multiple linear reference genomes, the task scales roughly linearly, (increasing by a factor n, where n is the number of LRGs to which data can be mapped).

By contrast, graph genomes can explicitly encode many alternate paths. Variants can be represented in this data structure, and reads can then be mapped exactly to both a reference path and variant sequences. Graphs have been shown to reduce reference bias and improve read mappings to variants^26–28. Variants can also be quickly genotyped from such structures. However, adding more variations can rapidly increase the complexity of the graph. In order to make the problem computationally tractable, one must carefully choose which genomic regions to include. Tools for working with graphs are still nascent, and there are few open-source graph implementations that enable mapping directly to SVs (e.g. vg²⁷ and Paragraph²⁸). Graphs are a non-linear alternative to represent genomes; different paths are presented in the graph data structure. Consequently, 1.) they can decrease reference bias and improve read mappings to variants^26–28; and 2.) variants can be efficiently genotyped. Adding more genomes to the graph enhances it, but increases its complexity. To decrease complexity and make this problem computationally tractable, the added regions must be chosen carefully.

Regardless of how drastically human individuals may differ in features such as their physical characteristics and behaviors, the human genomes between any two people are actually relatively well-conserved across individuals. The completion of de novo individual genomes is costly, and population-scale comparisons impossible with current technology infrastructure. There are a small number of individuals represented in the reference human genome, while in comparison any given human individual has on average ~4 million single-nucleotide polymorphisms (SNPs) and ~2,500 CNVs^16,29. Consequently, there is a burgeoning awareness that the current standard reference genome assemblies do not include available human variation data. Moreover, there is an urgent need for improved tools, which more precisely represent rare genotype(s) of individuals bearing haplotype(s) absent from the human genome^30,31. This is particularly relevant for genomic variants leading to clinical phenotypes such as in Mendelian diseases and genomic disorders where rare and ultra-rare variants have a prominent role. A reliable and scalable solution to this comprises developing a more comprehensive reference genome data structure that represents variations that exist in a given population, such as a graph genome⁸. The variants in a given cohort’s genomic sequences are represented as independent walks along the graph, allowing it to represent the reference cohort.

Overall, the quality assessment, representation, and annotation of SVs across multiple disciplines (e.g. whole-genome sequencing (WGS) and metagenomic sequencing) remain challenging. Thus, our SV codeathon groups focused on seven topics, which led to development of seven new computational methods. These topics included: 1.) quality assessment of population-scale VCF files; 2.) metagenomic assembly quality assessment; 3.) CNV detection and identification of de novo SVs using long-reads; 4.) fast genome graph generation; 5.) SV annotation; 6.) graph making and graph-based read-mapping with GPUs; and 7.) CNV detection quality control. Here we describe our progress, giving details about our tools’ implementations and applications to foster continuous development beyond the current scope of the tools as they were at the end of this codeathon. All methods are open source licensed, and have been made available on GitHub (https://github.com/NCBI-Codeathons/).

Methods

Implementation

Clouseau: rapid quality assessment of population-scale VCF files. As we progress from comparing SVs between a sample and its corresponding reference (capturing differences between a given sample and its reference genome) to large cohort genomic datasets (variants across multiple hundreds to thousands of samples), large variant call format (VCF) files are being generated. Generating these large VCF files often involves customized scripts or analysis methods leaving the possibility of human or other runtime errors. For example, undetected errors such as incomplete files or artificially missing data might be mistaken for real biological phenomena (deletions/truncations). Leaving these unchecked may lead to erroneous conclusions. We developed Clouseau to address these challenges. Clouseau is a command line tool allowing users to rapidly validate VCF file formatting, generate multiple QC statistics, providing rapid QC insights from the input VCF file.

MasQ: Metagenomic assemblies-focused Quality assessment. While metagenomic assemblies have significantly improved since the early days of the Human Microbiome Project (HMP), intragenomic and intergenomic repetitive sequences remain as confounders. Individual reads spanning microbial strains (either via long-read technology or by generating synthetic long-reads) may be pieced together to resolve variation within a given microbial community. However, this process is imperfect. A major concern is that detected structural variants could actually result from misassembled genomic data instead of actual strain-specific variation. The goal of this project is to identify errors in any metagenomic assembly based on both short and long read mapping, in the hope of eliminating some of the uncertainty and error in metagenomic studies. Examples of errors to detect include falsely called (false positive) inversions, chimeras (translocation), INDELs (less than 50 bases long), and replacements (large substitutions). We created a containerized quality control pipeline called MasQ to locate, classify, and rectify errors in metagenomic assemblies. Out of concern for the integrity of current and future metagenomic studies from short-read and long-read genome sequencing data, we found that the quality control of metagenomic assemblies could be substantially improved by using a combination of sequence alignment tools and VCF file outputs from SV callers such as Sniffles v1.0.8³² and Manta v1.5.0³³. The resulting SV call sets are subsequently compared and merged using packages like Truvari v0.1.2018.08.1 and SURVIVOR¹⁹. The current version of MasQ neatly packages this workflow and integrates it with novel correction and validation steps to fix any erroneous contigs found.

DeNovoSV: CNV detection and identification of de novo SV events using long reads. We developed a pipeline to identify de novo structural variants (SVs) from long-read (LR) sequencing data collected from trios (probands + parents). As described below, in a pilot study, we analyzed data from an individual who carries multiple de novo CNVs, initially identified by array-based comparative genomic hybridization (aCGH). Prior to this work, we lacked an integrated bioinformatics pipeline to identify high-confidence, de novo structural variants called from LR DNA sequencing (e.g., Oxford Nanopore Technology - ONT). Accordingly, to select further SV calls for orthogonal validation, we merged de novo LR SV calls with de novo SV calls independently identified by short-read (SR) DNA sequencing (i.e. PCR-free paired-end Illumina DNA sequencing, 150 paired-end reads, 40x depth of coverage) in this trio. DeNovoSV facilitates identification and prioritization of high-quality de novo SV calls that can be further validated by using targeted orthogonal methodologies such as Sanger sequencing or droplet digital PCR (ddPCR).

SWIGG (SWIft Genomes in a Graph): fast genome graph generation. There is a growing consensus across genomics that linear genome representation is suboptimal for representing variants across populations. While graph genomes have been steadily gaining popularity, many challenges remain (complexity, visualization, etc.). In this project, we developed a heuristic approach that quickly generates graphs and represents genomes in an efficient and succinct way. We created a simple algorithm and tool to build a graphical model that captures variability in the genomes at multiple scales. Moreover, there are regions across the human genome that are conserved among species while bearing modest amounts of variability that are suitable for understanding relationships of genome structure among individuals and/or organisms. We used a k-mer approach to create a sparse representation of such regions (anchors) at a large scale so as to allow visualizing the entire genome easily. These "anchored" graphs can then be further iteratively improved to include local sequence differences, and in turn, to help with genotyping existing variants and identifying new variants in new genomes.

ASAP (Automated Structural Variation Annotation Pipeline). ASAP is an automated and robust pipeline to annotate structural variations. The pipeline integrates annotations including allele frequency, colocated gene, functional features, domains, regulatory elements, and transcription levels. To facilitate further development, we took a pseudo-multistage build approach. Specifically, during stage one, we pull the main program, AnnotSV v2.2³⁴ from its remote source, as we expect this step to change relatively infrequently and the program itself is large. In stage two, based on the previously-built base image, we pull in its dependencies including BEDtools v2.29.0³⁵ and build the docker. The workflow is as follows: 1.) User provides a VCF containing SV as input into AnnotSV 2.) AnnotSV annotates the variations and outputs a tab-delimited file 3.) the R script “postprocess.R” is used to process the TSV file generated by AnnotSV and extracts essential annotations like ranking score. AnnotSV has many default annotation sources (https://lbgi.fr/AnnotSV/), and can also accept user-provided annotations as input. The output of this pipeline used the default annotation.

Super-minityper: graph making and graph-based read-mapping with GPUs in the cloud. We present a set of cloud-based workflows, composed mostly of preexisting and optimized tools, to 1.) construct graphs containing structural variants and 2.) map reads to these graphs. Our workflows allow users to make arbitrary SV calls, construct a graph, and map reads to these graphs. This workflow prioritizes ease-of-use and speed, accepting common input formats and returning results in minutes on commodity cloud virtual machines. Our approach is an example of what can be done now, and is generalizable to newer graph tools.

ScanCNV. Many current medical genomic studies still rely on CNV calling to identify de novo events that could have led to a certain disease phenotype. Nevertheless, a common problem for CNV detection is a high rate of false positives³⁶ due to multiple biases in leading short-read sequencing technologies (e.g. GC bias, repetitiveness and general unevenness of the sequence data). To identify false positives and thus improve the reliability of CNV calling pipelines, we designed ScanCNV, which includes multiple QC steps. These QC steps currently include FASTQC v0.11.9, XYAlign v1.1.6³⁷, and Plink v2.0³⁸ where all the information is currently assembled and vetted within ScanCNV. Future work is still required to automate the process and fully leverage the obtained QC results.

Operation

All methods were tested on DNA-Nexus instances azure:mem1_ssd1_x, 4, cores, 8G ram or azure:mem1_ssd1_x16, 16 cores, 32G ram.

Clouseau. Clouseau requires python 3.5 or above. Users supply optional parameters such as the expected maximum distance between variations to identify missing entries. Importantly, the input VCF file needs to be sorted by coordinates. The Clouseau pipeline reads in a VCF file and performs a sample-level basic statistic carried out to ensure that the VCF file is complete and was not truncated during the VCF generation. Next, the VCF is parsed and sample-level QC is carried out. This consists of checking the number of samples in the VCF, the number of chromosomes, the names of all chromosomes/contigs, the distribution and number of variants (single nucleotide variants, insertions, deletions, structural variants) in each chromosome/contig for all samples, and the coordinates for the start and end of each variant for each chromosome and sample. Clouseau further tries to identify missing entries based on long stretches of no variations, which might represent file errors (as occur in incomplete files) or real biological phenomena (as in deletion/truncation). The full workflow is shown in Figure 1.

Figure 1. Clouseau Workflow.

Clouseau starts with a VCF/pVCF/gVCF (variant call format) file that is assessed to ensure the completeness of the previous run. Furthermore, Clouseau assesses the overall statistics to give insights into the per sample quality control (QC). VCF, variant call format. pVCF, project variant call format. gVCF, genome variant call format. QC, quality control.

MasQ. The MasQ pipeline comprises four stages for metagenomic study quality control and validation: assembly, classification, correction, and validation. Our pipeline takes read files in FASTQ format as input, and classifies possible assembly errors as inversions, INDELs, substitutions, chimeras, or N/A, and then compares them with the make-up of the original assembly.

The purpose of the assembly stage is to do basic quality control on the sequencing data, generate a mock assembly from alignments of the read data, and generate VCF files informing about identified errors. Short-reads are pre-processed and visually assessed for quality, using FastQC. Then, all the reads are aligned to the MegaHIT assembly using BWA-MEM v0.7.4³⁹. The alignment results are processed through Manta v1.5.0³³, a structural variant caller, to produce a VCF file containing the detected errors. Optionally, if long reads are also available for the same sequencing sample, they are passed through a similar pipeline. The long reads are inspected for quality using NanoporeQC. Minimap2 v2.8⁴⁰ is used to align the long reads instead of BWA, then the long-read alignments are provided to Sniffles v1.0.8³² to call SVs, condensing this information into a VCF file.

In the classification stage, VCF files for the short and long reads are compared to each other using Truvari and a merged VCF file is generated with SV data. Short-read and long-read data each present their own sets of challenges while building metagenomic assemblies, so comparison of the Truvari results from short and long sequencing reads of the same samples decreases false positive results.

The correction stage takes the regions of assembly error ascertained from the classification stage and performs the necessary changes to make a more accurate metagenomic assembly. From here, the validation stage uses sequence aligners such as Minimap2 and BWA to compare the edited assembly to the original inputs, looking to confirm an increase in the percentage of mapped reads in the corrected assembly, which would show the success of the pipeline in locating and correcting assembly errors.

MasQ is implemented using Docker and is freely available on DockerHub. MasQ can also be run online using DNAnexus. All relevant parameters, as well as python scripts for the correction and validation steps, can be found on the GitHub repository. The MasQ pipeline is shown below in Figure 2.

Figure 2. MasQ pipeline to assess the quality of metagenomic assemblies.

Over multiple steps MasQ utilizes the short and long reads available to assess the quality of the before obtained metagenomic assembly.

DeNovoSV. DeNovoSV takes input VCF files produced by long-read SV callers such as Sniffles v1.0.11³² and short-read SV callers such as Lumpy v0.2.13⁴¹, Delly v0.8.2⁴², and Manta v1.6.0³³. We developed a custom shell script to remove calls that genotyped as a homozygous reference (GT=0/0), with read support less than 5 and aligned to haplotype chromosome, unplaced or unlocalized contigs (GL, KI, etc). Second, the filtered outputs are combined and compared by the pipeline using SURVIVOR v1.0.6¹⁹ merge allowing a maximum distance of 1000 bp measured pairwise from the start and end breakpoints of each SV, respectively; SVs classified as the same type; and SVs larger than 30 bp. Lastly, CNV calls are performed independently, using ONT data calculated on the resulting alignments using mosdepth v0.2.3⁴³ with the following parameter set “-F 3588 -Q 1”. This calculates the bp coverage in 100 kb bins and includes only primary and supplemental alignments. Normalization of the coverage signal is performed based on a custom script, which generates bedgraph data as output. Each individual bedgraph file pertaining to proband and parents is processed by dividing the 100 bp bins scores by the median of the coverage windows. Using this normalization, the majority of the genome shows a score of 1 (ie. similar coverage between the samples) while a CNV on the proband results in a score of 2 for a duplication. We load these files into JBrowse⁴⁴ with the multibigwig plugin for visualization, and to facilitate downstream analyses of the putative CNV calls. The resulting output is a list of high-confidence de novo SVs. The overall workflow of DeNovoSV is shown in Figure 3.

Figure 3. DeNovoSV workflow.

The schematic shows the required inputs and final outputs, along with intermediate steps. DeNovoSv starts with already aligned reads (BAM) from long reads and short reads to filter candidate SVs based on a trio to identify de novo SVs in the proband.

SWIGG. The SWIGG (SWIft Graph in a Genome) pipeline can take multiple FASTA sequences as input and outputs a graph that would be a sparse representation for a multiple sequence alignment which can be visualized for larger scale differences. First, the input FASTA sequence is processed to identify appropriate k-mers. Appropriate k-mers are those that are not repetitive within and across all FASTA sequences. The thresholds for appropriate k-mers can be modified using the script arguments. K-mers are then sorted based on their positions and collapsed into a node-edge list. This node-edge list can be visualized using a graph visualizer such as Gephi.

Our motivation to implement SWIGG (Figure 4) is that the human genome essentially contains two types of regions, those that are quite stable, and those that are hypervariable^45,46. For the stable regions, short reads can be either hashed as exact matches to these relatively static regions, or mapped with very small bubbles. Nevertheless, stable regions are interspersed by variable and hypervariable regions (e.g. Kir, MSC). We used k-mers to provide a backbone for these complex regions. Likewise, we have also looked at those regions that are very unstable for mapping in the reference genome to extrapolate to graphs. We used 64 of such regions located on human chromosome 6. They are graphically depicted using the NCBI Genome Data Viewer⁴⁷ in Figure 5. SWIGG is implemented in python 3 and publicly available both on GitHub.

Figure 4. SWIGG workflow diagram shows each step; extracting k-mers, finding positional information, and building graphs.

Figure 5. NCBI Genome Data Viewer showing complex regions in human chromosome 6.

ASAP. The user inputs an SV VCF file(s) produced from any SV caller, and the SV file is then ranked and annotated using AnnotSV and produces a tab-delimited file which will be processed using contemporary versions of R. The complete pipeline can be executed automatically as a single WDL script on GitHub. Below is the workflow Figure 6.

Figure 6. ASAP annotation workflow, which takes SV file(s) as input, then annotates and ranks SVs using AnnotSV.

Results/Use cases

Clouseau

Before proceeding with SV analysis, the user needs to have insight into the SV file to avoid issues with downstream analyses. Clouseau facilitates this best-practice quality control check by analyzing an input SV file and providing detailed information about its content. Clouseau takes as input an SV VCF file, and returns the number of samples in the file, chromosomes analyzed, distribution of SVs in all samples, and detailed information about each sample in the file. The output is a list of files for all the samples and each sample in the input file. Clouseau was benchmarked using the SV file from the 1000 Genomes Project¹⁶. The data set consists of 2,505 samples and consumed 1 minute (wall clock time) to analyze 10,000 variant lines, Figure 7 shows an example of the output results per sample.

Figure 7. Sample output from Clouseau running on the 1000 genomes data to assess correctness and completeness of variant call run.

MasQ

MasQ uses short- and long-read sequencing data as input. It outputs VCF files with the locations of structural variants, as well as a classification of assembly errors. Among the two existing methods of assembly validation, reference-based or de novo assembled, we chose to develop a de novo validation pipeline, and benchmarked it using two widely-used datasets, Zymo Microbial Community Standards⁴⁸ and Shakya⁴⁹. The Zymo dataset consisted of both Illumina pair-ended short reads and Oxford Nanopore long reads, while the Shakya dataset was made up of pair-ended short reads. The current version of MasQ is publicly available as a Docker container that combines existing tools for assembly, creation and comparison. It includes a Python classifier and correction to identify and fix assembly errors, and is open source on GitHub.

For the Shakya⁴⁹ data, metaSPAdes⁵⁰ assembly, metaQUAST⁵¹ reports that there are 328 misassemblies and the total misassembled contigs length is 12,555,565 base pairs. For Shakya, MEGAHIT⁵² assembly, metaQUAST reports that there are 673 misassemblies and the total misassembled contigs length is 20,244,280 base pairs. For both of the finished assemblies, metaQUAST reports more than hundreds of misassemblies and the total length is more than tens of million base pairs. Also, on the Shakya dataset, according to metaQUAST, the performance of metaSPAdes is slightly better than that of the MEGAHIT. For the MasQ result of both assemblies, both Manta and Sniffles call hundreds of mis-assemblies in two assemblies, which are about the same size of the result outputted by metaQUAST. From the long-read perspective, Sniffles calls less insertions and duplications on the metaSPAdes assembly. Even though metaSPAdes required significantly more time to finish assembling, the extension of edges through repetitive regions and careful decision to remove low coverage edges contributed to a lower insertion and duplication rate.

DeNovoSV

In a prototype trial of DeNovoSV, we investigated available genomics data obtained from a trio of samples (i.e. father, mother and proband), with a particular focus on identification of de novo SVs in the proband. Available data included ONT long-read sequences, Illumina short-read WGS from the trio and aCGH. Data from the latter were used as a positive control for known de novo CNVs for the purpose of developing this pipeline. Working with ONT long-read sequence data, we used the NGMLR³² aligner to map the data against the human genome reference assembly (hg38). Sniffles³² used BAM files from the previous step to identify candidate de novo SVs. Read depths of SVs were calculated using mosdepth⁴³. Sniffles outputs were used as source dataset for VCF files which were filtered using our DeNovoSV custom script to remove SVs for which supporting reads were limited in number (RE < 5); homozygous reference SVs; SV branching from autosomes to decoy; and contigs identified with GL numbers, which represent unlocalized contigs. De novo SVs in the child were defined by merging and comparing with parents in the trio using SURVIVOR¹⁹. After filtering for de novo SVs in the child using the DeNovoSV pipeline, we compared these de novo SVs and CNV candidates with results extracted from CGH array data.

We used the DeNovoSV pipeline to prioritize high-confidence, de novo SVs called from long-read (LR) technology. Starting with about 3 million SV calls identified by Sniffles for each individual, the de novo LR pipeline detected 4,509 high-quality SV calls. We further enriched for potentially true-positive calls by merging the long-read calls with SVs calls independently identified with short-read sequencing in this trio (N = 2599 high-quality SV calls). After merging the calls from short-read sequencing to filter for consensus calls, we obtained a list of 67 high-quality SV calls (Figure 8 and Figure 9).

Figure 8. Number of de novo high-quality SV calls detected using ONT LR sequences, Illumina short-read sequences, and filtered consensus calls detected by both methods.

Figure 9. Categories of consensus de novo SV calls from DeNovoSV.

Del: deletion; Dup: duplication; Inv: inversion; INS: insertion; TRA: translocation; UNK: unknown.

Eight de novo duplications in the proband were detected using normalized read depth coverage data from mosdepth and visualized in JBrowse (Figure 10).

Figure 10. DeNovoSV outputs.

JBrowse screenshots displaying normalized Oxford Nanopore read coverage for four out of eight de novo duplications spanning 900 kb to 1 Mb genomic segments from Chromosomes 4, 5, 6 and 10. Parent 1 is represented in a blue line, parent 2 in a light blue line and proband in a dark red line. Red rectangles denote duplications.

SWIGG

In SWIGG we created a simple algorithm and tool to build genome graphs, which is suitable for understanding relationships of genome structure among individuals and/or organisms. Our approach captured variations in a hierarchical way. The idea was to create a sparse representation of large-scale differences (anchors) so as to allow visualizing the entire genome in a succinct way. These "anchored" graphs can then be further iteratively improved to include local sequence differences, and in turn, facilitate genotyping existing variants and identifying new variants in new genomes. We tested SWIGG by creating a graph of the human MHC region (4.5Mb in size) Figure 11 using 128-mers in less than three minutes on seven alternative haplotypes. To evaluate our approach at a smaller scale, we tried building a graph for 10 HIV viral genome (each ~10kb) using 10-mers (see SWIGG Github repository).

Figure 11. A graph depiction of the MHC region for the eight alternative loci in the GRCh38 MHC region.

ASAP

ASAP was benchmarked using 53 non-diseased tissue sites across nearly 1000 individuals. GTEx RNA-seq data (retrieved from RNA-seq gene read counts GCT file https://gtexportal.org/home/datasets) and GTEx Metadata (manually curated from https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-5214/E-MTAB-5214.sdrf.txt) from NA12878 Germline Whole Genome were used. V2 Chromium Genome dataset by Long Ranger 2.2.1 downloaded WGS v2 deletions VCF file from https://support.10xgenomics.com/genome-exome/datasets/2.2.1/NA12878_WGS_v2. Messenger RNA features were extracted from NCBI latest RefSeq Annotation⁵³ on GRCh38 using instructions provided here https://www.ncbi.nlm.nih.gov/refseq/functionalelements/. The workflow for running benchmarking and results are shown in Figure 12 below.

Figure 12. ASAP benchmarking shows correlation between the deletion allele frequency and gene expression.