The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms

Kimberly Walker; Divya Kalra; Rebecca Lowdon; Guangyi Chen; David Molik; Daniela C. Soto; Fawaz Dabbaghie; Ahmad Al Khleifat; Medhat Mahmoud; Luis F Paulin; Muhammad Sohail Raza; Susanne P. Pfeifer; Daniel Paiva Agustinho; Elbay Aliyev; Pavel Avdeyev; Enrico R. Barrozo; Sairam Behera; Kimberley Billingsley; Li Chuin Chong; Deepak Choubey; Wouter De Coster; Yilei Fu; Alejandro R. Gener; Timothy Hefferon; David Morgan Henke; Wolfram Höps; Anastasia Illarionova; Michael D. Jochum; Maria Jose; Rupesh K. Kesharwani; Sree Rohit Raj Kolora; Jędrzej Kubica; Priya Lakra; Damaris Lattimer; Chia-Sin Liew; Bai-Wei Lo; Chunhsuan Lo; Anneri Lötter; Sina Majidian; Suresh Kumar Mendem; Rajarshi Mondal; Hiroko Ohmiya; Nasrin Parvin; Carolina Peralta; Chi-Lam Poon; Ramanandan Prabhakaran; Marie Saitou; Aditi Sammi; Philippe Sanio; Nicolae Sapoval; Najeeb Syed; Todd Treangen; Gaojianyong Wang; Tiancheng Xu; Jianzhi Yang; Shangzhe Zhang; Weiyu Zhou; Fritz J Sedlazeck; Ben Busby

doi:10.12688/f1000research.110194.1

Home Browse The third international hackathon for applying insights into large-scale...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms

[version 1; peer review: 1 approved, 3 approved with reservations]

Kimberly Walker ¹, Divya Kalra ¹, Rebecca Lowdon², [...] Guangyi Chen ^3,4, David Molik⁵, Daniela C. Soto⁶, Fawaz Dabbaghie^3,7, Ahmad Al Khleifat⁸, Medhat Mahmoud¹, Luis F Paulin¹, Muhammad Sohail Raza⁹, Susanne P. Pfeifer¹⁰, Daniel Paiva Agustinho¹¹, Elbay Aliyev¹², Pavel Avdeyev¹³, Enrico R. Barrozo¹⁴, Sairam Behera¹, Kimberley Billingsley¹⁵, Li Chuin Chong¹⁶, Deepak Choubey¹⁷, Wouter De Coster^18,19, Yilei Fu²⁰, Alejandro R. Gener²¹, Timothy Hefferon²², David Morgan Henke²³, Wolfram Höps²⁴, Anastasia Illarionova²⁵, Michael D. Jochum¹⁴, Maria Jose²⁶, Rupesh K. Kesharwani¹, Sree Rohit Raj Kolora²⁷, Jędrzej Kubica²⁸, Priya Lakra²⁹, Damaris Lattimer³⁰, Chia-Sin Liew³¹, Bai-Wei Lo³², Chunhsuan Lo³³, Anneri Lötter³⁴, Sina Majidian³⁵, Suresh Kumar Mendem³⁶, Rajarshi Mondal³⁷, Hiroko Ohmiya³⁸, Nasrin Parvin³⁷, Carolina Peralta³⁹, Chi-Lam Poon⁴⁰, Ramanandan Prabhakaran⁴¹, Marie Saitou⁴², Aditi Sammi⁴³, Philippe Sanio⁴⁴, Nicolae Sapoval²⁰, Najeeb Syed¹², Todd Treangen²⁰, Gaojianyong Wang⁴⁵, Tiancheng Xu²⁰, Jianzhi Yang⁴⁶, Shangzhe Zhang⁴⁷, Weiyu Zhou⁴⁸, Fritz J Sedlazeck ¹, Ben Busby⁴⁹

Kimberly Walker ¹, Divya Kalra ¹, [...] Rebecca Lowdon², Guangyi Chen ^3,4, David Molik⁵, Daniela C. Soto⁶, Fawaz Dabbaghie^3,7, Ahmad Al Khleifat⁸, Medhat Mahmoud¹, Luis F Paulin¹, Muhammad Sohail Raza⁹, Susanne P. Pfeifer¹⁰, Daniel Paiva Agustinho¹¹, Elbay Aliyev¹², Pavel Avdeyev¹³, Enrico R. Barrozo¹⁴, Sairam Behera¹, Kimberley Billingsley¹⁵, Li Chuin Chong¹⁶, Deepak Choubey¹⁷, Wouter De Coster^18,19, Yilei Fu²⁰, Alejandro R. Gener²¹, Timothy Hefferon²², David Morgan Henke²³, Wolfram Höps²⁴, Anastasia Illarionova²⁵, Michael D. Jochum¹⁴, Maria Jose²⁶, Rupesh K. Kesharwani¹, Sree Rohit Raj Kolora²⁷, Jędrzej Kubica²⁸, Priya Lakra²⁹, Damaris Lattimer³⁰, Chia-Sin Liew³¹, Bai-Wei Lo³², Chunhsuan Lo³³, Anneri Lötter³⁴, Sina Majidian³⁵, Suresh Kumar Mendem³⁶, Rajarshi Mondal³⁷, Hiroko Ohmiya³⁸, Nasrin Parvin³⁷, Carolina Peralta³⁹, Chi-Lam Poon⁴⁰, Ramanandan Prabhakaran⁴¹, Marie Saitou⁴², Aditi Sammi⁴³, Philippe Sanio⁴⁴, Nicolae Sapoval²⁰, Najeeb Syed¹², Todd Treangen²⁰, Gaojianyong Wang⁴⁵, Tiancheng Xu²⁰, Jianzhi Yang⁴⁶, Shangzhe Zhang⁴⁷, Weiyu Zhou⁴⁸, Fritz J Sedlazeck ¹, Ben Busby⁴⁹

PUBLISHED 16 May 2022

Author details Author details

¹ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
² Bayer Crop Science, Chesterfield, MO, 63017, USA
³ Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbrücken, Germany
⁴ Center for Bioinformatics, Saarland University, Saarbrücken, Germany
⁵ Tropical Crop and Commodity Protection Research Unit, Pacific Basin Agricultural Research Center, Hilo, HI, 96720, USA
⁶ Biochemistry & Molecular Medicine, Genome Center, MIND Institute, University of California, Davis, Davis, CA, 95616, USA
⁷ Institute for Medical Biometry and Bioinformatics, University hospital Düsseldorf, Düsseldorf, Germany
⁸ Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
⁹ CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Beijing, China
¹⁰ Center for Evolution and Medicine, Arizona State University, Tempe, AZ, USA
¹¹ Department of Molecular Microbiology, Washington University in St. Louis School of Medicine, St. Louis, MO, 63110, USA
¹² Research Department, Sidra Medicine, Doha, Qatar
¹³ Computational Biology Institute, The George Washington University, Washington, DC, 20052, USA
¹⁴ Department of Obstetrics & Gynecology, Baylor College of Medicine, Houston, TX, 77030, USA
¹⁵ Molecular Genetics Section, Laboratory of Neurogenetics, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
¹⁶ Beykoz Institute of Life Sciences and Biotechnology, Bezmialem Vakif University, Beykoz, Istanbul, Turkey
¹⁷ Department of Technology, Savitribai Phule Pune University, Pune, Maharashtra, India
¹⁸ Applied and Translational Neurogenomics Group, VIB Center for Molecular Neurology, Antwerp, Belgium
¹⁹ Applied and Translational Neurogenomics Group, Department of Biomedical Sciences, University of Antwerp, Antwerp, Belgium
²⁰ Department of Computer Science, Rice University, Houston, TX, USA
²¹ Association of Public Health Labs, Centers for Disease Control and Prevention, Downey, CA, USA
²² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20892, USA
²³ Department Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, 77030, USA
²⁴ EMBL Heidelberg, Genome Biology Unit, Heidelberg, Germany
²⁵ German Center for Neurodegenerative Diseases (DZNE), Tübingen, Germany
²⁶ Centre for Bioinformatics, Pondicherry University, Pondicherry, India
²⁷ University of California Berkeley, Berkeley, CA, USA
²⁸ University of Warsaw, Warsaw, Poland
²⁹ Department of Zoology, University of Delhi, Delhi, India
³⁰ University of Applied Sciences Upper Austria - FH Hagenberg, Mühlkreis, Austria
³¹ Center for Biotechnology, University of Nebraska-Lincoln, Lincoln, Nebraska, 68588, USA
³² Department of Biology, University of Konstanz, Konstanz, Germany
³³ Human Genetics Laboratory, National Institute of Genetics, Japan, Mishima City, Japan
³⁴ Department of Biochemistry, University of Pretoria, Pretoria, South Africa
³⁵ Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
³⁶ ICAR-NIVEDI, Bangalore, Karnataka, India
³⁷ Department of Biotechnology, The University of Burdwan, West Bengal, India
³⁸ Genetic Reagent Development Unit, Medical & Biological Laboratories Co., Ltd., Tokoyo, Japan
³⁹ Max Planck Institute for Evolutionary Biology, Plon, Germany
⁴⁰ Weill Cornell Medicine, New York, NY, USA
⁴¹ Hoffmann-La Roche Limited, Regions, Diagnostics & Research (RDR), Mississauga, Canada
⁴² Center of Integrative Genetics (CIGENE),Faculty of Biosciences, Norwegian University of Life Sciences, As, Norway
⁴³ School of Biochemical Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India
⁴⁴ University of Applied Sciences Upper Austria - FH Hagenberg, Hagenberg im Mühlkreis, Austria
⁴⁵ Max Planck Institute for Molecular Genetics, Berlin, Germany
⁴⁶ Department of Quantitative and Computational Biology,, University of Southern California, Los Angeles, CA, USA
⁴⁷ School of Biology, University of St Andrews, St Andrews, UK
⁴⁸ Department of Statistical Science, George Mason University, Fairfax, Virginia, USA
⁴⁹ DNAnexus, Mountain View, CA, USA

Kimberly Walker
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Divya Kalra
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Rebecca Lowdon
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Guangyi Chen
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

David Molik
Roles: Formal Analysis, Methodology, Software, Visualization, Writing – Review & Editing

Daniela C. Soto
Roles: Formal Analysis, Methodology, Software, Visualization, Writing – Review & Editing

Fawaz Dabbaghie
Roles: Formal Analysis, Methodology, Software, Writing – Review & Editing

Ahmad Al Khleifat
Roles: Formal Analysis, Methodology, Software, Writing – Review & Editing

Medhat Mahmoud
Roles: Formal Analysis, Methodology, Software, Writing – Review & Editing

Luis F Paulin
Roles: Formal Analysis, Methodology, Software, Writing – Review & Editing

Muhammad Sohail Raza
Roles: Formal Analysis, Methodology, Software, Writing – Review & Editing

Susanne P. Pfeifer
Roles: Formal Analysis, Methodology, Software, Writing – Review & Editing

Daniel Paiva Agustinho
Roles: Formal Analysis, Methodology, Software

Elbay Aliyev
Roles: Formal Analysis, Methodology, Software

Pavel Avdeyev
Roles: Formal Analysis, Methodology, Software

Enrico R. Barrozo
Roles: Formal Analysis, Methodology, Software

Sairam Behera
Roles: Formal Analysis, Methodology, Software

Kimberley Billingsley
Roles: Formal Analysis, Methodology, Software

Li Chuin Chong
Roles: Formal Analysis, Methodology, Software

Deepak Choubey
Roles: Formal Analysis, Methodology, Software

Wouter De Coster
Roles: Formal Analysis, Methodology, Software

Yilei Fu
Roles: Formal Analysis, Methodology, Software

Alejandro R. Gener
Roles: Formal Analysis, Methodology, Software

Timothy Hefferon
Roles: Formal Analysis, Methodology, Software

David Morgan Henke
Roles: Formal Analysis, Methodology, Software

Wolfram Höps
Roles: Formal Analysis, Methodology, Software

Anastasia Illarionova
Roles: Formal Analysis, Methodology, Software

Michael D. Jochum
Roles: Formal Analysis, Methodology, Software

Maria Jose
Roles: Formal Analysis, Methodology, Software

Rupesh K. Kesharwani
Roles: Formal Analysis, Methodology, Software

Sree Rohit Raj Kolora
Roles: Formal Analysis, Methodology, Software

Jędrzej Kubica
Roles: Formal Analysis, Methodology, Software

Priya Lakra
Roles: Formal Analysis, Methodology, Software

Damaris Lattimer
Roles: Formal Analysis, Methodology, Software

Chia-Sin Liew
Roles: Formal Analysis, Methodology, Software

Bai-Wei Lo
Roles: Formal Analysis, Methodology, Software

Chunhsuan Lo
Roles: Formal Analysis, Methodology, Software

Anneri Lötter
Roles: Formal Analysis, Methodology, Software

Sina Majidian
Roles: Formal Analysis, Methodology, Software

Suresh Kumar Mendem
Roles: Formal Analysis, Methodology, Software

Rajarshi Mondal
Roles: Formal Analysis, Methodology, Software

Hiroko Ohmiya
Roles: Formal Analysis, Methodology, Software

Nasrin Parvin
Roles: Formal Analysis, Methodology, Software

Carolina Peralta
Roles: Formal Analysis, Methodology, Software

Chi-Lam Poon
Roles: Formal Analysis, Methodology, Software

Ramanandan Prabhakaran
Roles: Formal Analysis, Methodology, Software

Marie Saitou
Roles: Formal Analysis, Methodology, Software

Aditi Sammi
Roles: Formal Analysis, Methodology, Software

Philippe Sanio
Roles: Formal Analysis, Methodology, Software

Nicolae Sapoval
Roles: Formal Analysis, Methodology, Software

Najeeb Syed
Roles: Formal Analysis, Methodology, Software

Todd Treangen
Roles: Formal Analysis, Methodology, Software

Gaojianyong Wang
Roles: Formal Analysis, Methodology, Software

Tiancheng Xu
Roles: Formal Analysis, Methodology, Software

Jianzhi Yang
Roles: Formal Analysis, Methodology, Software

Shangzhe Zhang
Roles: Formal Analysis, Methodology, Software

Weiyu Zhou
Roles: Formal Analysis, Methodology, Software

Fritz J Sedlazeck
Roles: Conceptualization, Formal Analysis, Methodology, Resources, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Ben Busby
Roles: Conceptualization, Formal Analysis, Methodology, Resources, Software, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Agriculture, Food and Nutrition gateway.

This article is included in the Emerging Diseases and Outbreaks gateway.

This article is included in the Hackathons collection.

This article is included in the Python collection.

This article is included in the Max Planck Society collection.

Abstract

In October 2021, 59 scientists from 14 countries and 13 U.S. states collaborated virtually in the Third Annual Baylor College of Medicine & DNANexus Structural Variation hackathon. The goal of the hackathon was to advance research on structural variants (SVs) by prototyping and iterating on open-source software. This led to nine hackathon projects focused on diverse genomics research interests, including various SV discovery and genotyping methods, SV sequence reconstruction, and clinically relevant structural variation, including SARS-CoV-2 variants. Repositories for the projects that participated in the hackathon are available at https://github.com/collaborativebioinformatics.

Keywords

Structural variants, k-mer, Covid-19, Long-reads, Tomatoes, Cancer, Viral integration, Hackathon, NGS

Corresponding authors: Kimberly Walker, Divya Kalra, Rebecca Lowdon, Guangyi Chen, Fritz J Sedlazeck, Ben Busby

Competing interests: Ben Busby is a full-time employee of DNAnexus. Rebecca Lowdon is a full-time employee of Bayer Crop Sciences. Ramanandan Prabhakaran is a full-time employee of Hoffmann-La Roche Limited. Luis F Paulin is sponsored by Genentech, Inc. Wouter De Coster has received travel reimbursement and free consumables from ONT. FJS received research support from ONT and PacBio. Alejandro Rafael Gener is an editorial board member of AIDS, and has received poster bursaries from ONT in 2019.

Grant information: Tim Heffernon is supported by the intramural research program of the National Library of Medicine. AAK is funded by ALS Association Milton Safenowitz Research Fellowship, The Motor Neurone Disease Association (MNDA) Fellowship (Al Khleifat/Oct21/975-799) and The NIHR Maudsley Biomedical Research Centre. SPP is supported by a National Science Foundation CAREER grant (DEB-2045343). Marie Saitou is supported by The Research Council of Norway (SalmoSV, grant number 325874). Wouter De Coster is supported by a postdoctoral fellowship from the FWO (1233221N). Sina Majidian is supported by the Swiss National Science Foundation, Grant number 186397. David Molik (DCM) is supported by the USDA Agricultural Research Service HQ Research Associate program in Big Data. Shangzhe Zhang is funded by China Scholarship Council PhD scholarship 202106180022. ARG: “This publication was supported by Cooperative Agreement Number NU60OE000104-02, funded by the Centers for Disease Control and Prevention through the Association of Public Health Laboratories. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Centers for Disease Control and Prevention or the Association of Public Health Laboratories.”

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Walker K et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The author(s) is/are employees of the US Government and therefore domestic copyright protection in USA does not apply to this work. The work may be protected under the copyright laws of other jurisdictions when used in those jurisdictions.

How to cite: Walker K, Kalra D, Lowdon R et al. The third international hackathon for applying insights into large-scale genomic composition to use cases in a wide range of organisms [version 1; peer review: 1 approved, 3 approved with reservations]. F1000Research 2022, 11:530 (https://doi.org/10.12688/f1000research.110194.1) First published: 16 May 2022, 11:530 (https://doi.org/10.12688/f1000research.110194.1) Latest published: 16 May 2022, 11:530 (https://doi.org/10.12688/f1000research.110194.1)

Introduction

One of the processes by which genomes incur deleterious changes are commonly linked to the genetic signatures known as structural variants (SVs). SVs are large genomic alterations, where large is typically (and somewhat arbitrarily) defined as encompassing at least 50 base pairs (bp). These genomic variants are typically classified as deletions, duplications, insertions, inversions, and translocations describing different combinations of DNA gains, losses, or rearrangements. Copy number variations (CNVs) are a particular subtype of SVs mainly represented by deletions and duplications. SVs are typically described as single events, although more complex scenarios involving combinations of SV types exist.¹^,² Understanding how and why SVs occur can help gain a deeper understanding of evolutionary processes driving species divergence and phenotypic adaptation, genomic processes leading to genetic variation and etiologies of plant and animal diseases.³ With a recent deluge of available genomic data, SVs are an optimal target for computational biology research.⁴

In October 2021, 59 researchers from 14 countries participated virtually in the third Baylor College of Medicine & DNAnexus hackathon, focusing on interrelated topics such as SVs, short tandem repeats (STRs), k-mer profiling, viruses, reference refinement and annotation. The hackathon groups addressed questions around: the use of SVs in the localization and understanding of quantitative trait loci (QTL), reference-free analysis of SVs, parallelization of SV workflows, the assessment and refining the quality of detected SVs, use of SVs in the understanding of adaptation in viruses, and understanding genetic signatures of diseases through SVs. The international hackathon focused on nine softwares to answer these questions; eight of which we present in this paper: STRdust, kTom, INSeption, GeneVar2, cov2db, K-var, Imavirus, and a Reference Panel Generator (RPG) for diverse sequencing data analysis. Several emergent themes became apparent over the course of the hackathon.

QTLs link a phenotypic trait to a local genomic region, and in its broadest definition, a molecular change affecting a phenotype.⁵ A direct connection can be drawn between some SVs and QTLs. Linking traits and their genetic underpinnings is a common practice in the fields of agricultural genomics, molecular evolution, and genetic disease research.⁶ Structural variation is one possible genomic change that could result in a QTL. This year’s hackathon featured work on tomatoes and other plants which provided an alternative viewpoint to the generally human-focused research of previous hackathons. Such cross-disciplinary research allows disparate groups working on similar problems to push the envelope of what is possible with current technologies.

Nucleotide sequence substrings of length k (k-mers) continue to prove useful in SV work and in genomics, however, the time needed to assess the frequency of SVs presents a resource problem.⁷ The reduction of the computational resources required to complete an SV assessment in a genome would allow greater amounts of SV data to be processed in genomic workflows. Many bioinformatic tools currently used to locate genomic SVs use a sliding window alignment technique, which can be time-consuming.⁸^,⁹ However, implementing a k-mer based approach to create a pool of reference k-mers of known SVs, the annotation speed of variation in new genomes might be increased.¹⁰^,¹¹ k-mers have also been used in alignment-free methods, bypassing the need for reference genomes.¹²

A portion of the hackathon focused on virus work. At the time of the hackathon, the COVID-19 pandemic was ongoing and the question of what SVs are present, and how they might change the behavior of SARS-CoV-2 was unresolved.

Together the projects of this hackathon represent a range of fields, a range of academic, industry, and government researchers, and a range of desired impacts in the field of SV analysis. Topical introductions to the specific work of each group can be found below, except from “nibSV” which was reported previously¹¹ and did not achieve significant progress.

STRdust: Detect and genotype short tandem repeats

Short tandem repeats (STRs) (i.e., repeated instances of short 2-6 bp DNA motifs) are widespread in the genomes of most organisms. Due to their highly polymorphic nature, STRs are frequently employed in population and evolutionary genomic studies ranging from genealogy to forensics and disease diagnostics.¹³ For example, in humans, expansions in functional STRs have been linked to many neurological and developmental disorders¹⁴^,¹⁵ whereas in plants, STRs have been found to impact several traits important to agriculture including growth rate and yield.¹⁶ Yet, despite their importance, STRs remain relatively poorly characterized in most species. On the one hand, second-generation sequencing platforms (e.g., Illumina¹⁷ (RRID:SCR_010233)) are limiting our view of STR variation within the read length due to both the short length of sequencing reads produced as well as frequent amplification biases (such as GC-biases and over−/under-representation of certain reads on a genome-wide scale). On the other hand, third-generation sequencing platforms (namely, PacBio (PacBio Sequel II System,¹⁸ (RRID:SCR_017990)) and Oxford Nanopore Technologies (ONT)¹⁹ (RRID:SCR_003756)) allow for the generation of single-molecule reads spanning tens to hundreds of kilobases in length but error rates (~1% in PacBio HiFi reads and ~ 10–15% in ONT²⁰) continue to exacerbate reliable STR detection. To mitigate this issue, several long-read STR calling methods have been developed in recent years, including PacmonSTR²¹(RRID:SCR_002796), NanoSatellite,²² TRiCoLOR²³(RRID:SCR_018801), and Straglr²⁴ – however, their usability remains limited due to platform and/or computational demands. In order to address these shortcomings, we introduce STRdust, a tool to accurately detect and genotype STRs from long reads.

kTom: k-mers for profiling tomato introgressions

The success of commercially cultivated vegetables requires a balance of selection for domestication traits while maintaining genomic diversity and quality characteristics, and this is particularly true for tomato breeding programs.²⁵ Many desirable traits for crops are obtained by crossing elite breeding germplasm to wild relatives that carry a trait of interest (e.g., disease resistance or fruit flavor). This process of moving a genomic region from one species or distantly-related species into another is called introgression.²⁵

Tomato is an important crop and indispensable in the diet of many cultures and regions. The demand for fresh and processed tomatoes makes them one of the most important vegetables grown globally, with >180 million tons of tomatoes produced in 2019 worldwide (FAOSTAT).²⁶

Genetic traits have been moved into cultivated tomatoes over the past several decades of tomato breeding through trait introgression. Identifying and tracking introgressed traits is a crucial function of modern tomato breeding.²⁵ The introgression of traits often occurs as large presence/absence structural variants with novel genes or sequences. Some introgressions can be completely defined by de novo sequencing and assembly, but this can be expensive for many samples and is not always successful for more complex genomic introgressions.² These complex structural variation patterns, coupled with the lack of reference genomes for many wild tomato relatives, complicate the efforts to locate or characterize the introgressed traits in the elite germplasm’s genome. Consequently, most marker sets today rely exclusively on SNPs, which do not always track diverse tomato genetics.²⁷

Here we present kTom, a tool to characterize the k-mer content of re-sequenced genomes and to identify k-mers that are unique to traited samples. kTom is a collection of off-the-shelf tools arranged to allow for a tractable characterization of k-mer frequencies in a population. We used re-sequenced tomato accessions for this demonstration, but the same approach can work for any species. Having a reference-free method to characterize and track introgression sequences will give researchers more agility to understand the nature of important traits.²⁸

INSeption: Polishing structural variants

Some types of SVs, such as insertions, play a crucial role in shaping the genome and thus the function of each gene. For example, more than 50 percent of mammalian genomes include a repeating DNA sequence known as transposable elements.²⁹ Additionally, insertions can indicate an early tumorigenic event,³⁰ demonstrating a role in disease, making it crucial to accurately identify them.

Read-based SV calling methods broadly fall into the categories of alignment- and assembly-based approaches.² In alignment-based approaches, SVs are inferred from patterns of abnormal read mapping on an existing reference sequence.² Alignment-based approaches pose a popular method for calling SVs both from short-reads and long-reads, with a multitude of tools developed for both read mapping (e.g., BWA³¹(RRID:SCR_010910), Minimap2³²(RRID:SCR_018550), and NGMLR³³(RRID:SCR_017620)) and SV detection (e.g., DELLY³⁴(RRID:SCR_004603) and SNIFFLES³³(RRID:SCR_004603)). A downside of alignment-based SV detection lies in the incomplete resolution of complex or large genomic rearrangements or insertions exceeding common read lengths.³⁵ By contrast, assembly-based approaches utilize de novo sequence assemblies computed directly on the sampled reads, circumventing any biases introduced by the use of reference sequences.² SVs are thereby called by aligning such assemblies against a reference and identifying local incongruencies. Commonly used tools include Canu³⁶ (RRID:SCR_015880) and Flye³⁷ (RRID:SCR_017016) for sequence assembly, Minimap2 and BlasR³⁸ for alignment against a reference and SGVar³⁵ and Paftools³² for SV calling. Assembly-based approaches can resolve even complex rearrangements and long insertions, but the construction of high-quality, haplotype-resolved assemblies requires thorough quality control and typically a high quality and diversity of data.³⁹

GeneVar2: Gene-centric data browser for structural variants

Next-generation sequencing (NGS) technologies can be a powerful source in uncovering underlying genetic causes of diseases, but significant challenges still remain for SV interpretation and clinical analysis for clinicians.⁴⁰ Although various tools are available to predict the pathogenicity of a protein-changing variant—a list of these is available at OpenCRAVAT—they do not always agree, further compounding the problem.⁴¹

Here we present GeneVar2: an open access, gene-centric data browser to support structural variant analysis. There are two ways to interact with GeneVar2. First, GeneVar2 takes an input of a gene name or an ID and produces a report that informs the user of all SVs overlapping the gene and any non-coding regulatory elements affecting expression of the gene. Second, users can upload variant call format (VCF) files from their analysis pipelines as input to GeneVar2. GeneVar2 will output clinically relevant information as well as provide useful visualizations of disease ontology and enrichment pathway analysis based on SV types.

cov2db: A low frequency variant database for SARS-CoV-2

Global SARS-CoV-2 sequencing efforts have resulted in a massive genomic dataset availability to the public for a variety of analyses. However, the two most common resources are genome assemblies (deposited in GISAID⁴¹ (RRID:SCR_018251) and GenBank⁴² (RRID:SCR_002760), for example) and raw sequencing reads. Both of these limit the quantity of information, especially with respect to variants found within the SARS-CoV-2 populations. Genome assemblies only contain common variants, which is not reflective of the full genomic diversity within a given sample (even a single patient derived sample represents a viral population within the host⁴³^–⁴⁶). Raw sequencing reads on the other hand require further analyses in order to extract variant information, and can often be prohibitively large in size.

Thus, we propose cov2db; a database resource for collecting low frequency variant information for available SARS-CoV-2 data. As of October 2021 there were more than 1.2 million SARS-CoV-2 sequencing datasets in the Sequence Read Archive (SRA)⁴⁷(RRID:SCR_004891) and European Nucleotide Archive (ENA)⁴⁸ (RRID:SCR_006515). Our goal is to provide an easy to use query system, and contribute to a database of VCF files that contain variant calls for SARS-CoV-2 samples. We hope that such interactive databases will speed up downstream analyses and encourage collaboration.

K-var: A “fishing” expedition for phenotype associated k-mers

k-mers are commonly used in bioinformatics for genome and transcriptome assembly, error correction of sequencing reads, and taxonomic classification of metagenomes.⁴⁹^,⁵⁰ More recently, k-mers have been used for genotyping of structural variations in large datasets in a mapping-free manner.⁵¹ Sample comparison based on k-mers profiles provides a computationally efficient mapping-free way to address key differences between two biological conditions, avoiding the limitations of reference bias, mappability and sequencing errors.⁵²^–⁵⁴ Of particular interest are case-control studies, that allow to pinpoint genetic loci putatively implicated with a phenotype or a disease.

Here we develop a pipeline that takes a sample’s sequencing data from two distinct conditions (ideally control vs. treatment or two different conditions) as input and compares their k-mer profiles in order to highlight k-mers associated with the phenotype. This approach was tested in a panel of cancer cell lines from the NCI-60 dataset (RRID:SCR_003057) contrasting primary and metastatic tissues to highlight mutational signatures underlying cancer progression.

Imavirus: Virus integration in disease

Viral infections impact human health as they can lead to short- and long-term diseases,⁵⁵ including cancers. Different forms of cancer are caused by viruses such as human papillomaviruses⁵⁶ and hepatitis B virus capable of integrating into the host genome.⁵⁷ Other viruses such as human immunodeficiency viruses (HIV) integrate into the host genome as a normal part of viral replication, contributing to cancer indirectly, and less commonly directly through insertional mutagenesis.⁵⁸ Knowing exactly where the integration events occur can help researchers and ultimately clinicians to better understand the effect of virus integration in disease.

Common assumptions about integrations are that they are single copy and show an absence of additional structural variability.⁵⁸ Different mechanisms might lead to different insertion site topology. For example, one would expect a difference between natural HIV-1 p31 integrase-mediated integration (insertion + tandem duplication of five bases of host target site) vs. insertion of viral genomic content (after reverse transcription in case of retroviruses like HIV) with host cell’s DNA repair machinery. Such differences might include conservation of viral terminal repeat elements with virus-specific insertion signatures⁵⁹ vs. divergence⁶⁰ from this pattern.

When considering model insertion sites for assay evaluation, insertion site location heterogeneity exists to varying degrees in natural infection (with different mechanisms such as virus-dependent integration vs. host-dependent insertion contributing differently) vs. transgenic model organism (in the case of the Tg26 HIV-1 transgenic mouse, pronuclear injection and insertion of restriction enzyme-digested pNL4–3.⁶¹ NL4–3 is the most common lab strain of HIV-1.⁶²

With advances in sequencing technologies,⁶³^,⁶⁴ high-throughput sequencing data is available to explore viral genome integration space. Integration sites can be detected through identification of breakpoints between host and virus genome(s).⁶⁵ Some integrating viruses can produce run-on transcripts or may participate in trans-splicing between virus exon and downstream host exons.⁶⁶ Integration events have been previously detected by identifying these and other signatures such as chimeric reads in short-read sequencing (single-end and paired-end) and long-read sequencing.⁶⁵^,⁶⁷^–⁷⁵

Here, we suggest tools and a general workflow that can be used for virus integration detection and discuss current caveats in using publicly available datasets for this type of analysis.

RPG: Reference Panel Generator

Despite great advances in our knowledge of NGS data analysis, a diverse complete reference genome sequence is lacking for humans. This leads to lack of sensitivity for detecting small insertions and deletions (INDELs) and structural variation, incomplete architecture of large polymorphic CNVs and correctly calling single nucleotide variants (SNVs) at complex genomic regions. High-quality Telomere-to-Telomere (T2T) CHM13 long-read genome assembly from T2T consortium⁷⁶ could be utilized as a reference panel to universally improve read mapping and variant calling.

Currently, we aim to provide a revised version of CHM13 reference panels along with an RPG pipeline based on 1000 Genomes Project⁷⁷ (RRID:SCR_006828) common allele calls and those abnormally avoided stop codons. Overall, such reference panels will greatly improve future population-scale diverse sequencing data analysis and correctly identify hundreds of thousands of novel per-sample variants in clinical settings.

Methods

DNAnexus (RRID:SCR_011884), a cloud platform, was used to run the code developed at the hackathon. It provides flexibility to run a wide array of software applications either on a cloud workstation (default number of cores = 8) or on an interactive environment such as a Jupyter notebook (default number of cores = 16). One of these two resources were used to run the software during the hackathon, unless otherwise specified.

STRdust

STRdust¹⁴² parses the CIGAR (a compressed representation of an alignment that is used in the SAM file format) of each read, either genome-wide or in user-specified loci, in order to identify sufficiently large (>15 bp) insertions or soft-clipped bases which could indicate the presence of an enlarged STR. The sequence of those candidate-expansions is extracted, along with 50 bp of flanking sequence. Leveraging the phased input data, such insertions are combined per haplotype when multiple of these are found close by (within 50 bp) across multiple reads. The combination is done using spoa 4.0.7,⁷⁸ which generates a multiple sequence alignment and from that a consensus sequence. The obtained consensus sequence, in which inaccuracies inherent to the long read sequencing technologies should be reduced, is then used in mreps 6.2.01,⁷⁹ which will assess the repetitive character of the sequence and identify the repeat unit (Figure 1).

Figure 1. STRdust workflow.

During the preparation phase, reads (either simulated or sequenced) are aligned to the corresponding reference genome with Minimap2³² and the mapped reads are then phased using longshot. Next, STRdust identifies insertions and soft-clips from the Concise Idiosyncratic Gapped Alignment Report (CIGAR) string which identify regions of possible short tandem repeats (STR) expansion. These regions are further analyzed by performing de novo assembly using spoa and assessing the repetitiveness of the region with mreps. STRdust outputs the STR genotype as a tab separated table for further analysis. We evaluated STRdust by comparing the results of simulated STR expansions produced by SimiSTR based on the human (Genome Reference Consortium Human Build 38, GRCh38) and tomato (Solanum lycopersicum 4.0, SL4.0) reference genomes, to two novel tools: Straglr²⁴ and TRiCoLOR.²³

STRdust was tested against simulated STR datasets produced by SimiSTR. SimiSTR modified the GRCh38 (human) and SL4.0 (tomato) reference genome assemblies. Additional variation (SNVs) was introduced with SURVIVOR 1.0.7⁸⁰ at a rate of 0.001.

Long reads were simulated using SURVIVOR⁸⁰ for the GRCh38 (human) and SL4.0 (tomato) STR-modified genomes. Mapping was performed with Minimap2³² 2.24 two-fold (with and without the -Y parameter), and phasing was done with longshot 0.4.1.⁸¹ Default parameters were used for all tools, if not otherwise mentioned. STRdust results were compared to TRiCoLOR 1.1,²³ and Straglr 1.1.1²⁴ using default parameters. Figure 1 shows the workflow of STRdust described in this section.

STRdust is very easy to implement. One can, simply input the bam file after cloning the python script as follows: python3 STRdust/STRDust.py mapped_long_reads.bam -o results_dir. For further details on installation and implementation, review our github page.

kTom

kTom (k-mers for profiling Tomato introgressions)¹⁴³ aims to use k-mers to tag introgressions in elite tomato germplasm.

Current implementation

The kTom workflow (Figure 2a) processes re-sequenced genomes (only tested with Illumina short reads to date) to generate k-mer profiles per sample and calculates the population frequencies of these k-mers. Our use case is focused on k-mers with low-mid range frequencies, which we believe should capture k-mers unique to introgressed traits in our test population. Therefore, we use these k-mers to generate a distance matrix and understand the relatedness of samples.

Figure 2. (a). kTom workflow, with major steps for individual sample and population data processing. (uniq = get unique reads; dedup = deduplicate reads). (b). k-mer frequency heatmap from kTom.

Frequency of selected k-mers in each accession analyzed. Differential k-mer frequencies are apparent in this view. Depending on the nature of the accessions, this view may provide a first glimpse into genetic sequences underlying structural variations that differentiate the accessions.

To prototype the kTom workflow, we used 40 Whole Genome Shotgun (WGS) datasets from the 84 tomato or wild species accessions generated by The 100 Tomato Genome Sequencing Consortium⁸² (BioProject PRJEB5235).

Data processing

Raw FASTQ files were quality-checked with FastQC version 0.11.9⁸³(RRID:SCR_014583) and trimmed with Flexbar version 1.4.0⁸⁴(RRID:SCR_013001), clipping five bases on 5′ and 3′ ends and keeping reads with quality score > 20 and a minimum length of 50. k-mers were counted using functions in Jellyfish version 2.3.0⁸⁵(RRID:SCR_005491) (jellyfish count followed by jellyfish histo) with kmersize = 21. The k-mers histogram was generated with Genomescope version 1.0.0⁸⁶(RRID:SCR_017014). k-mer counts for individual samples were then aggregated into a k-mer frequency matrix of k-mers as rows and samples as columns. This frequency matrix can be visualized as an interactive heatmap (example Figure 2b) by running kmer_heatmap.R which uses ComplexHeatmap version 2.8.0⁸⁷ (RRID:SCR_017270), InteractiveComplexHeatmap version 1.1.3⁸⁸ and tidyverse v1.3.1⁸⁹ (RRID:SCR_019186) R packages.

INSeption

INSeption¹⁴⁴ was tested using HiFi reads for sample HG002 (RRID:CVCL_1C78) retrieved from the genome in a bottle (GIAB) project.⁹⁰ The reads were aligned against GRCh37 using Minimap2³² and Sniffles 1.012³³ was used to call SVs. We filtered out SVs that were supported by less than 10 reads using bcftools 1.12⁹¹ (RRID:SCR_005227). We extracted insertions that are larger than 999 nucleotides. No reads span the entire insertion. Additionally, we filtered reads that were not aligned to reference using samtools 1.14⁹¹ (RRID:SCR_002105), with the -f 4 option. Finally, we extracted reads that support each insertion studied: first, we extracted read names from the SV file using bcftools and grouped them using SV ID, followed by extracting the FASTA sequence from the binary alignment map (BAM) file using samtools and awk (Figure 3a, left-hand side).

Figure 3. (a). INSeption workflow. Showing the tools used in the pipeline to detect insertion by extracting clipped reads (A), extracting unaligned reads (B), and then assembly (C) or clustering, assembling and aligning (D). SV: structural variant, INS:insertion, BAM: binary alignment map. (b). INSeption workflow, a graphical representation of the pipeline in (3a) showing two insertions, red and orange, in (A) and (B) we extract the unaligned reads (C), cluster them into groups (D), assemble each cluster (E) and finally align clipped reads to the assembled cluster (F).

Allele frequency

For an analysis of the allele frequency (AF) for each mutation type, we created a Python⁹²^,⁹³ (RRID:SCR_008394) script (SVStat.py) that takes a VCF as input. For each SV type, it stores the AF and how often this AF was encountered. This data is then being visualized in n different plots (with n representing the number of SV types), where the x-axis represents the AF and the y-axis represents the number of times each SV type occurs.

Clustering unmapped reads

To be able to assemble a sequence from all unmapped reads, we tried several approaches. We attempted to identify clusters of reads using the LROD version 1.0⁹⁴ package, which we found unsuitable for our purposes due to long runtimes. More successfully, we used the program CARNAC-LR version 1.0.0⁹⁵ to build clusters of reads using Minimap2 version 2.22 aligner³² and a subsequent k-mer based clustering approach. As output, for each cluster, all sequences and their IDs were exported into a FASTA file. On our testing dataset, we identified 64 such clusters. These clustered read files are then the basis for the next step for subsequent sequence assembly (Figure 3a right-hand side).

Delegate read clusters to the sequence assembler

All cluster.fasta files were loaded into the assembler programs (Flye version 2.9³⁷ and Spades version 3.15.3,⁹⁶ see software availability for input parameters) with another python script (clusterAssemble.py). This script has the ability to run a single cluster.fasta file or a whole batch within a directory. The inputs are the program location, program name, an optional flag: multi (for running the batch of clusters), an input directory or an input file, and an output directory (Figure 3a right-hand side continued).

Identifying integration sites for assembled clusters

Having successfully assembled contigs for N = 15 read clusters using Canu v2.2³⁶(RRID:SCR_015880), we searched for overlap of these contigs with the breakpoint regions of 30 previously identified long insertion sites. We reasoned that for each assembled contig which represents an insertion sequence, reads supporting the insertion breakpoint should also overlap with that specific contig. To find such contigs of interest, we first extracted the sequence reads (n = 604) which support a long inversion and therefore overlap at least one insertion breakpoint. This set of reads was then aligned against all 15 assembled contigs using Minimap2 (parameters: -x map-hifi -P), and using the contigs as a ‘pseudo’ reference. Finally, we manually inspected the resulting alignments to identify long (>3 kbp) contigs overlapping reads (Figure 3b).

GeneVar2

GeneVar2¹⁴⁵ is an update of GeneVar,¹¹ to help inform clinical interpretation of structural variants (Figure 4). It has expanded options allowing users to upload a VCF file, while maintaining its search functionality—based on gene name-on its web interface. GeneVar2 annotates the uploaded VCF file with a number of items which can then be downloaded by the user. Annotations include: SV allele frequency from gnomAD-SV⁸⁵(RRID:SCR_014964) and probability of being loss-of-function intolerant (pLI) from gnomAD; transcripts and coding regions of the impacting gene from GENCODE (v35)⁹⁷; the gene associations with corresponding phenotype annotation from OMIM¹⁰⁰; and known clinical SVs and their pathogenicity from dbVar.⁸⁶

Figure 4. High-level outline of GeneVar2 workflow.

Green boxes represent the initial features of GeneVar, implemented last year, while blue boxes represent new features implemented in GeneVar2 during this hackathon. (VCF: variant call format, SV: structural variation, CDS: coding sequence).

Additionally, when a user uploads a VCF file, an option to download graphs for visualizing SVs in the dataset, is available. There is an alternate format, comma-separated values (CSV), available to download with an annotated VCF. GeneVar2, written in R, is available on GitHub (Software availability section) with detailed instructions on installation and usage. GeneVar2 is a web-based application that can also be installed by an individual on their platform to run on the command line and launch locally. Instructions on how to build and run GeneVar2 on DNAnexus can be found here.

When users launch GeneVar2 as a web-application, they can enter individual gene names (HGNC⁹⁸(RRID:SCR_002827)), Ensembl⁹⁹ (RRID:SCR_002344) gene accession (ENSG) or Ensembl transcript accession (ENST) for extracting various SVs overlapping their gene of choice. GeneVar2 will output the gene-level summary with detailed information about the SVs within the gene. It links the gene information to databases such as OMIM¹⁰⁰ (RRID:SCR_006437), GTEx¹⁰¹(RRID:SCR_013042), gnomAD and allele frequency is reported based on gnomAD genomes and exomes.

If users first need to call SVs on their samples, the developers recommend Parliament2¹⁰²(RRID:SCR_019187). Parliament2 runs a combination of tools to generate structural variant calls on whole-genome sequencing data. It can run the following callers: Breakdancer¹⁰³(RRID:SCR_001799), Breakseq2,¹⁰⁴ CNVnator¹⁰⁵(RRID:SCR_010821), Delly2,³⁴ Manta,¹⁰⁶ and Lumpy¹⁰⁷(RRID:SCR_003253). Because of synergies in how the programs use computational resources, these are all run in parallel. Parliament2 will produce the outputs of each of the tools for subsequent investigation. See the Parliament2 GitHub page for further details.

After users upload a VCF file containing SVs, GeneVar2 annotated each entry with the genes overlapping the SV, allele frequency from gnomAD-SV, and assigns a clinical rank to all the SVs in the VCF relative to each other. This is accomplished using the main annotation script annotate_vcf.R. The final annotated file is available for download as a VCF and CSV format. For Gene and Disease ontology and pathway analysis, GeneAnnotationFromCSV. R supports the enrichment analysis using KEGG¹⁰⁸^–¹¹⁰(RRID:SCR_012773), Disease Ontology (DO),¹¹¹ Network of Cancer Gene¹¹² and Disease Gene Network (DisGeNET)¹¹³ (RRID:SCR_006178). In addition, several visualization methods were provided by Bioconductor package clusterprofiler¹¹⁴ (RRID:SCR_016884) and enrichplot¹¹⁵ to help interpreting enrichment and disease ontology results.

Alternatively, if users prefer they can run GeneVar2 on the command line, by installing it on their platform. Users should have R version 4.1 or higher installed. In addition, you will need to have sveval, a custom R library, installed which can be accessed via BiocManager using ‘jmonlong/sveval’. Scripts and instructions can be found on GeneVar2’s Github repository in the software availability section.

cov2db

cov2db¹⁴⁶ is implemented as a set of modular scripts which enable the user to annotate and reformat their original VCF files into mongoDB (RRID:SCR_021224) ready JavaScript object notation (JSON) documents. Namely, there are three key components provided within the code repository¹: the VCF annotation and processing framework, together with the relevant software and scripts²; a sample set of annotated VCFs that can be used as a starting point for a SARS-CoV-2 iSNV database³; an R Shiny¹¹⁶ (RRID:SCR_001626) app to facilitate a graphical user interface (GUI) for the interactions and quick summaries of the data within the database (Figure 5). The fields to query the cov2db database, such as annotation and variant information, are listed in the readme on our Github page. All of the above can be used to spin up an independent instance of cov2db and provide a user interface to interact with it. Minimal system requirements for a local cov2db instance are dictated by the mongoDB requirements with the key limiting factor being RAM used. Large variant databases will consume substantial amounts of RAM, and we suggest hosting those on dedicated high memory compute servers. Cov2db can run on x86 *nix-style platforms as is. We have not tested the software on ARM architectures or Windows based hosts. End users can interact with a hosted database from any web browser.

Figure 5. Cov2db workflow architecture.

User provided variant call format (VCF) (or iVar output) files are annotated and ultimately converted into JavaScript object notation (JSON). The resulting JSON files serve as the primary input into the database. Secondary input can be provided by supplying any relevant metadata with the sample accession numbers serving as key. The resulting database can be queried directly via mongoDB command-line interface (CLI) or summarized and presented visually via the corresponding R Shiny app. AWS: Amazon web services.

Our current design supports input VCFs generated by LoFreq¹¹⁷ (RRID:SCR_013054) or converted into VCFs from the iVar¹¹⁸ output via provided script. These files are subsequently annotated with snpEff¹¹⁹ (RRID:SCR_005191) using the SARS-CoV-2 reference, and resulting information is recorded as an annotated VCF. Finally, we provide an additional script to convert the annotated VCFs into JSON files that can be directly integrated into the mongoDB database. Metadata intake for the database is separate, and linking between the metadata for the samples and the variant call data is done within the database via the accession number keys.

K-var

As a proof of concept for K-var,¹⁴⁷ we used whole exome sequencing of the NCI-60 dataset, a panel of 60 different human tumor cell lines widely used for the screening of compounds to detect potential anticancer activity (Figure 6). k-mer frequencies were obtained for each sample, using the tool Jellyfish version 2.3.0. First, counts of k-mers of size 31 were obtained with jellyfish count. Using a custom script, k-mers sequence and counts were tabulated to facilitate downstream analyses. The frequency distribution was plotted using R v3.6.3¹²⁰ (RRID:SCR_000432), and low frequency k-mers likely arising from sequencing errors were removed. We measured the relevance of k-mers to the condition using TF-IDF (term frequency-inverse document frequency) with pre-defined control and test datasets. k-mers significantly correlated to the disease are extracted using logistic regression followed by ranking and/or classification of the significant k-mers. The genomic positions of the disease associated k-mers were identified and these positions were run through the ensembl-VEP pipeline to detect probable biological consequences.

Figure 6. K-var workflow.

The k-mer composition of whole-genome sequencing (WGS) sequencing data from cases and controls is obtained using Jellyfish. Rare and common k-mers are identified based on their frequency across samples, and mapped to a reference genome to assess their putative functional impact. Selected k-mers are then compared between cases and controls using term frequency-inverse document frequency (TF-IDF) statistical modeling to evaluate association with the phenotype of interest. As a proof of concept, K-var was implemented using cancer samples from the NCI-60 dataset.

Imavirus

There’s an abundance of public high-throughput sequencing data (e.g. via the National Center for Biotechnology Information Sequence Read Archive). Some integrating viruses can produce run-on transcripts or may participate in trans-splicing between virus exon and downstream host exons.¹²¹ Others have shown that it is possible to identify integration events by identifying chimeric reads in single-end short-read and paired-end short-read sequencing, as well as long read sequencing.⁶⁵^,⁶⁷^–⁷⁵ Others have not yet interrogated available large public datasets with current iterations of mapping.¹⁴⁸

We sought to do so by scoping out the available data and exploring at least one control dataset. We then generated a non-exhaustive list of relevant human pathogenic viruses and evaluated tools for unbiased interrogation of paired-end short-read data. Minimap2 version 2.22,³² HISAT2 version 2.2.1¹²² (RRID:SCR_015530), and STAR version 2.7.9a¹²³ (RRID:SCR_004463) were evaluated on paired-end short-read RNA-seq from the Tg26 mouse model with HIV believed to be inserted as a transgene. Minimap2 did not work for visual exploration by default, possibly because it treats paired-end reads as single-end. Mapped reads were viewed in IGV colored by orientation and with “view as pairs” selected. HISAT2 and STAR, both split-read mappers, worked to identify at least one previously identified insertion site on mouse chr8.¹²⁴ Finally, we refined this approach using human plus individual virus genomes (Figure 7).

Figure 7. Imavirus workflow.

To scope out the samples relevant for viral integration studies, human viruses known to integrate were chosen, along with viruses believed not to integrate (negative control set). Not shown, a dataset to contain human immunodeficiency virus (HIV) sequence (Tg26) and to express HIV protein was used as a positive control for pipeline development. Sequence Read Archive (SRA) was evaluated for the presence of RNA-seq (expression) and DNA-seq (host genomic DNA) from relevant viruses. A generic pipeline was evaluated on the positive control dataset with the goal of processing viral samples in SRA. Future work would also evaluate identified insertion/integration sites for possible clinical relevance. (GO: Gene Ontology).

The mouse model used includes two “insertion sites” on chr8, one on chr18, two on chrX, and a camouflaged one on chr4 embedded in a LINE element (the last site validated by long-read sequencing and deep paired-end 150 genomic DNA sequencing). These sites segregated together when multiple animals were genotyped and sequenced.¹²⁵^,¹²⁶ This behavior is suggestive of a yet to be defined complex structural variation encompassing multiple HIV transgene “copies” together with parts of different mouse chromosomes. The Tg26 HIV-1 transgenic mouse model⁶¹ illustrates the current limitations of using short-read sequencing, which may only capture virus:host junctions (insertion/integration half-sites) in the absence of recapitulating the entire insertion site unambiguously. When deriving putative viral integration sites from RNA-seq, sites may be more likely to be detected if coming from highly expressed loci.

RPG

RPG¹⁴⁹ is a scalable and easy to apply pipeline that utilizes input genome assembly (FASTA format) and gene annotations (GFF3 format), and outputs reference panels based on the 1000 Genomes Project (1KGP) common allele calls and those abnormally avoided stop codons. Currently, the RPG pipeline is tested on the T2T-CHM13 genomic data set provided by T2T consortium in an effort to provide high-quality reference panels for diverse sequencing data analysis (Figure 8). The generation of this panel is described in Figure 8 and the accompanying figure legend.

Figure 8. Overview of the reference panel generator pipeline for revising CHM13 reference panel.

CHM13 genome sequence (FASTA), gene annotations (GFF3), and combined 1000 Genomes Project single nucleotide variants (SNVs) and insertion/deletion (INDEL) call sets in variant call format (VCF) are retrieved from Amazon-AWS¹²⁷ (RRID:SCR_012854) cloud. Only common alleles (>5% allele frequency (AF)) in the variant call set are retained. ClinVar¹²⁸ database was used to annotate variant calls with any clinical significance. Subsequently, common allele calls are replaced with CHM13 rare alleles in CHM13 FASTA genome sequence. Finally, screen-out in-frame stop-codon sites from genome sequence in order to generate the final reference panel files in FASTA format.

The resultant output T2T genome features completeness (i.e. filled gaps in its genomic sequence) compared to previously available GRCh38 releases. It further harbors 1KGP common alleles and avoids stop codons. Such T2T genomic sequence can be utilized in the ‘read mapping’ and ‘variant calling’ steps while processing whole genome sequencing (WGS) data and has important applications in improving structural variant identification. The output files generated by RPG pipeline are available in GitHub repository (Software availability section) along with supplementary pre-processing scripts.