The fifth international hackathon for developing computational cloud-based tools and resources for pan-structural variation and genomics

Sontosh K Deb; Divya Kalra; Jędrzej Kubica; Erik Stricker; Van Q. Truong; Qiandong Zeng; Christopher J. Fiscus; Daniel Paiva Agustinho; Adam Alexander; Marlon Arciniega-Sanchez; Lorianne Bosseau; Christian Brueffer; Astrid Canal; Joyjit Daw; David Enoma; Alison Diaz-Cuevas; Colin Diesh; Janet M. Doolittle-Hall; Luis Fernandez-Luna; Tina Han; Wolfram Höps; Peiming Peter Huang; Tony Huang; Michal Bogumil Izydorczyk; Farhang Jaryani; Rupesh K. Kesharwani; Shaheerah Khan; Sina Majidian; Ayan Malakar; Tania Girão Mangolini; Sejal Modha; Mauricio Moldes; Rajarshi Mondal; Abdullah Al Nahid; Chi-Lam Poon; Sagayamary Sagayaradj; Philippe Sanio; Tania Sepulveda-Morales; Muhammad Shahzaib; Muhammad Sohail Raza; Trinh Tat; Ishaan Thota; Umran Yaman; Jason Yeung; Qiyi Yu; Xinchang Zheng; Medhat Mahmoud; Fritz J. Sedlazeck; Ben Busby

doi:10.12688/f1000research.148237.1

Home Browse The fifth international hackathon for developing computational cloud-based...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

The fifth international hackathon for developing computational cloud-based tools and resources for pan-structural variation and genomics

[version 1; peer review: 2 approved with reservations]

Sontosh K Deb¹, Divya Kalra ², Jędrzej Kubica ³, [...] Erik Stricker⁴, Van Q. Truong^5,6, Qiandong Zeng ⁷, Christopher J. Fiscus⁸, Daniel Paiva Agustinho², Adam Alexander^9,10, Marlon Arciniega-Sanchez^11,12, Lorianne Bosseau¹³, Christian Brueffer^14,15, Astrid Canal¹⁶, Joyjit Daw¹⁷, David Enoma¹⁸, Alison Diaz-Cuevas¹⁹, Colin Diesh²⁰, Janet M. Doolittle-Hall²¹, Luis Fernandez-Luna^11,12, Tina Han²², Wolfram Höps²³, Peiming Peter Huang², Tony Huang²⁴, Michal Bogumil Izydorczyk², Farhang Jaryani², Rupesh K. Kesharwani², Shaheerah Khan²⁵, Sina Majidian²⁶, Ayan Malakar²⁷, Tania Girão Mangolini²⁸, Sejal Modha²⁹, Mauricio Moldes³⁰, Rajarshi Mondal³¹, Abdullah Al Nahid³², Chi-Lam Poon³³, Sagayamary Sagayaradj³⁴, Philippe Sanio², Tania Sepulveda-Morales^11,12, Muhammad Shahzaib³⁵, Muhammad Sohail Raza^36,37, Trinh Tat²⁵, Ishaan Thota³⁸, Umran Yaman³⁹, Jason Yeung⁴⁰, Qiyi Yu⁴¹, Xinchang Zheng², Medhat Mahmoud², Fritz J. Sedlazeck^2,42, Ben Busby⁴³

Sontosh K Deb¹, Divya Kalra ², [...] Jędrzej Kubica ³, Erik Stricker⁴, Van Q. Truong^5,6, Qiandong Zeng ⁷, Christopher J. Fiscus⁸, Daniel Paiva Agustinho², Adam Alexander^9,10, Marlon Arciniega-Sanchez^11,12, Lorianne Bosseau¹³, Christian Brueffer^14,15, Astrid Canal¹⁶, Joyjit Daw¹⁷, David Enoma¹⁸, Alison Diaz-Cuevas¹⁹, Colin Diesh²⁰, Janet M. Doolittle-Hall²¹, Luis Fernandez-Luna^11,12, Tina Han²², Wolfram Höps²³, Peiming Peter Huang², Tony Huang²⁴, Michal Bogumil Izydorczyk², Farhang Jaryani², Rupesh K. Kesharwani², Shaheerah Khan²⁵, Sina Majidian²⁶, Ayan Malakar²⁷, Tania Girão Mangolini²⁸, Sejal Modha²⁹, Mauricio Moldes³⁰, Rajarshi Mondal³¹, Abdullah Al Nahid³², Chi-Lam Poon³³, Sagayamary Sagayaradj³⁴, Philippe Sanio², Tania Sepulveda-Morales^11,12, Muhammad Shahzaib³⁵, Muhammad Sohail Raza^36,37, Trinh Tat²⁵, Ishaan Thota³⁸, Umran Yaman³⁹, Jason Yeung⁴⁰, Qiyi Yu⁴¹, Xinchang Zheng², Medhat Mahmoud², Fritz J. Sedlazeck^2,42, Ben Busby⁴³

PUBLISHED 27 Jun 2024

Author details Author details

¹ Department of Biological Sciences, The University of Alabama, Tuscaloosa, AL, 35487, USA
² Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, 77030, USA
³ Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, 02-097, Poland
⁴ Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
⁵ Genomics and Computational Biology Graduate Group, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA
⁶ Institute for Biomedical Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA
⁷ Laboratory Corporation of America Holdings, Westborough, MA, 01581, USA
⁸ Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, 92697, USA
⁹ Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, 28223, Mecklenburg County, USA
¹⁰ Atrium Health, Oral Medicine Research Laboratory, Charlotte, NC, 28232, USA
¹¹ Licenciatura en Ciencias Genómicas, Escuela Nacional de Estudios Superiores Unidad Juriquilla, Universidad Nacional Autónoma de México, Querétaro, 76230, Mexico
¹² International Laboratory for Human Genome Research, Universidad Nacional Autónoma de México, Juriquilla, Querétaro, 76230, Mexico
¹³ École Pivaut Nantes, Nantes, 44000, France
¹⁴ InSilico Consulting AB, Lund, 22649, Sweden
¹⁵ Division of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, Lund, 22381, Sweden
¹⁶ Magistère Européen de Génétique, Université Paris Cité, Paris Cedex, 75205, France
¹⁷ Oxford Nanopore Technologies, New York, NY, 10013, USA
¹⁸ Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada
¹⁹ International Laboratory for Human Genome Research, Universidad Nacional Autónoma de México, Querétaro, 76230, Mexico
²⁰ Department of Bioengineering, University of California San Diego, Berkeley, CA, 94720, USA
²¹ Q Squared Solutions, Durham, NC, 27703, USA
²² Twist Bioscience, South San Francisco, CA, 94080, USA
²³ European Molecular Biology Laboratory, Heidelberg, 69115, Germany
²⁴ Allianz Tech, Tower First, Courbevoie, 92400, France
²⁵ Center for RNA Therapeutics, Houston Methodist Research Institute, Houston, TX, 77030, USA
²⁶ Department of Computational Biology, University of Lausanne, Lausanne, 1015, Switzerland
²⁷ Genomic & Bioinformatics Analysis Resource, Columbia University Medical Center, New York, NY, 10032, USA
²⁸ Albert Einstein Hospital, Sao Paulo/Sao Paulo, 05652-900, Brazil
²⁹ Theolytics Limited, The Sherard Building, Littlemore, Oxford, OX4 4DQ, UK
³⁰ Centre for Genomic Regulation, Barcelona, 08003, Spain
³¹ Department of Bioinformatics, Pondicherry University, Puducherry, 605014, India
³² Department of Biochemistry and Molecular Biology, Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
³³ Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10065, USA
³⁴ BASF, Davis, California, 95618, USA
³⁵ National University of Science and Technology, Islamabad, 24090, Pakistan
³⁶ Department of Biotechnology, University of Sialkot, Sialkot, 51310, Pakistan
³⁷ Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics / China National Center for Bioinformation, Chinese Academy of Sciences, Beijing, 100101, China
³⁸ Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, USA
³⁹ University College London, UCL Dementia Research Institute, Wing 1.2 Cruciform Building, London, Department of Neurodegenerative Disease Gower Street, London, WC1E 6BT, UK
⁴⁰ Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX, 77555-0569, USA
⁴¹ Mellon College of Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
⁴² Department of Computer Science, Rice University, Houston, TX, USA
⁴³ DNAnexus, Mountain View, CA, 94040, USA

Sontosh K Deb
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Divya Kalra
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jędrzej Kubica
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Erik Stricker
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Van Q. Truong
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Qiandong Zeng
Roles: Formal Analysis, Methodology, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Christopher J. Fiscus
Roles: Formal Analysis, Methodology, Software, Visualization

Daniel Paiva Agustinho
Roles: Formal Analysis, Methodology, Software

Adam Alexander
Roles: Formal Analysis, Methodology, Software

Marlon Arciniega-Sanchez
Roles: Formal Analysis, Methodology, Software

Lorianne Bosseau
Roles: Formal Analysis, Methodology, Software

Christian Brueffer
Roles: Formal Analysis, Methodology, Software

Astrid Canal
Roles: Formal Analysis, Methodology, Software

Joyjit Daw
Roles: Formal Analysis, Methodology, Software

David Enoma
Roles: Formal Analysis, Methodology, Software

Alison Diaz-Cuevas
Roles: Formal Analysis, Methodology, Software

Colin Diesh
Roles: Formal Analysis, Methodology, Software

Janet M. Doolittle-Hall
Roles: Formal Analysis, Methodology, Software

Luis Fernandez-Luna
Roles: Formal Analysis, Methodology, Software

Tina Han
Roles: Formal Analysis, Methodology, Software

Wolfram Höps
Roles: Formal Analysis, Methodology, Software

Peiming Peter Huang
Roles: Formal Analysis, Methodology, Software

Tony Huang
Roles: Formal Analysis, Methodology, Software

Michal Bogumil Izydorczyk
Roles: Formal Analysis, Methodology, Software

Farhang Jaryani
Roles: Formal Analysis, Methodology, Software

Rupesh K. Kesharwani
Roles: Formal Analysis, Methodology, Software

Shaheerah Khan
Roles: Formal Analysis, Methodology, Software

Sina Majidian
Roles: Formal Analysis, Methodology, Software

Ayan Malakar
Roles: Formal Analysis, Methodology, Software

Tania Girão Mangolini
Roles: Formal Analysis, Methodology, Software

Sejal Modha
Roles: Formal Analysis, Methodology, Software

Mauricio Moldes
Roles: Formal Analysis, Methodology, Software

Rajarshi Mondal
Roles: Formal Analysis, Methodology, Software

Abdullah Al Nahid
Roles: Formal Analysis, Methodology, Software

Chi-Lam Poon
Roles: Formal Analysis, Methodology, Software

Sagayamary Sagayaradj
Roles: Formal Analysis, Methodology, Software

Philippe Sanio
Roles: Formal Analysis, Methodology, Software

Tania Sepulveda-Morales
Roles: Formal Analysis, Methodology, Software

Muhammad Shahzaib
Roles: Formal Analysis, Methodology, Software

Muhammad Sohail Raza
Roles: Formal Analysis, Methodology, Software

Trinh Tat
Roles: Formal Analysis, Methodology, Software

Ishaan Thota
Roles: Formal Analysis, Methodology, Software

Umran Yaman
Roles: Formal Analysis, Methodology, Software

Jason Yeung
Roles: Formal Analysis, Methodology, Software

Qiyi Yu
Roles: Formal Analysis, Methodology, Software

Xinchang Zheng
Roles: Formal Analysis, Methodology, Software

Medhat Mahmoud
Roles: Conceptualization, Formal Analysis, Methodology, Resources, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Fritz J. Sedlazeck
Roles: Conceptualization, Formal Analysis, Methodology, Resources, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Ben Busby
Roles: Conceptualization, Formal Analysis, Methodology, Resources, Software, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Hackathons collection.

Abstract

Background

The goal of the Fifth Annual Baylor College of Medicine & DNAnexus Structural Variation Hackathon was to push forward the research on structural variants (SVs) by rapidly developing and deploying open-source software. The event took place in-person and virtually in August 2023, when 49 scientists from 14 countries and 8 U.S. states collaboratively worked on projects to address critical gaps in the field of genomics. The hackathon projects concentrated on developing bioinformatic workflows for the following challenges: RNA transcriptome comparison, simulation of mosaic variations, metagenomics, Mendelian variation, SVs in plant genomics, and assembly vs. mapping SV calling comparisons.

Methods

As a starting point we used publicly available data from state-of-the-art long- and short-read sequencing technologies. The workflows developed during the hackathon incorporated open-source software, as well as scripts written using Bash and Python. Moreover, we leveraged the advantages of Docker and Snakemake for workflow automation.

Results

The results of the hackathon consists of six prototype bioinformatic workflows that use open-source software for SV research. We made the workflows scalable and modular for usability and reproducibility. Furthermore, we tested the workflows on example public data to show that the workflows can work. The code and the data produced during the event have been made publicly available on GitHub (https://github.com/collaborativebioinformatics) to reproduce and built upon in the future.

Conclusions

The following sections describe the motivation, lessons learned, and software produced by teams during the hackathon. Here, we describe in detail the objectives, value propositions, implementation, and use cases for our workflows. In summary, the article reports the advancements in the development of software for SV detection made during the hackathon.

Keywords

SVs, k-mers, RNASeq, Metagenomics, Mosaic, Long-reads, Hackathon, NGS

Corresponding authors: Sontosh K Deb, Divya Kalra, Jędrzej Kubica, Erik Stricker, Van Q. Truong, Qiandong Zeng

Competing interests: Ben Busby is a full time employee of DNAnexus. Janet Doolittle-Hall is a full time employee of Q Squared Solutions. Qiandong Zeng is an employee of Laboratory Corporation of America Holdings, a company providing clinical diagnostics services. Sejal Modha is a full time employee of Theolytics Limited. Christian Brueffer is the owner of InSilico Consulting AB and a shareholder of SAGA DX, Inc. Joyjit Daw is a full time employee of Oxford Nanopore Technologies plc. Van Q. Truong is supported by the Microsoft Research PhD Fellowship and ACM SIGHPC Computational and Data Science Fellowship.

Grant information: This work and hackathon were in part supported by NIH [ 1UG3NS132105-01, 1U01HG011758-01]
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2024 Deb SK et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Deb SK, Kalra D, Kubica J et al. The fifth international hackathon for developing computational cloud-based tools and resources for pan-structural variation and genomics [version 1; peer review: 2 approved with reservations]. F1000Research 2024, 13:708 (https://doi.org/10.12688/f1000research.148237.1) First published: 27 Jun 2024, 13:708 (https://doi.org/10.12688/f1000research.148237.1) Latest published: 27 Jun 2024, 13:708 (https://doi.org/10.12688/f1000research.148237.1)

Introduction

Structural variants (SVs) are large genomic variations with sizes of at least 50 bps occurring in form of insertions (INS), deletions (DELs), inversions (INVs), duplications (DUPs), and inter-chromosomal translocations.¹^–³ Recent discoveries have shown SVs to exhibit clinical relevance for multiple diseases beyond classical mendelian diseases, such as multiple types of cancer,⁴ neurodevelopmental,⁵ and cardiovascular disorders.⁶ Nevertheless, detection and evaluation of SVs is still plagued by high false positive and negative rates along with the inaccuracies of breakpoint predictions due to the complex nature of mutations and inherent sample heterogeneities despite the advances in next-generation sequencing technologies.⁷^,⁸ Third generation sequencing technologies provided by Pacific BioSciences,⁹ Oxford Nanopore Technologies,¹⁰ optical mapping,¹¹ and NanoString,¹² as well as new short read technologies provide exciting new tools to enhance SV detection. However, these advancements require new bioinformatic methods for implementation to consequently offer potential for understanding the relationship between these variants and various phenotypes. Accordingly, the objective of the Fifth Baylor College of Medicine & DNAnexus hackathon was to propose and develop novel bioinformatic tools and workflows to improve the use of SV data in disease modeling.

At the Fifth Baylor College of Medicine & DNAnexus hackathon, in August 2023, 49 scientists from 14 nations (see Figure 1) participated in-person and remotely - focusing on different topics on SV related research.

Figure 1. Hackathon participants locations.

Participants from around the world for the Fifth Annual Baylor College of Medicine & DNAnexus Structural Variation hackathon.

Overall, this manuscript details our tools’ objectives, value-add, implementations, and applications to set the foundation for further concept development. In this article we present six software workflows that were the result of this hackathon.

IsoComp: Comparing isoform composition between cohorts using high-quality long-read RNA-seq

Alternative splicing is one of the most complicated cellular processes. The joining or skipping of exons from the same gene in various combinations leads to different, but related, mRNA transcripts (isoforms). These mRNA isoforms can be translated to produce diverse proteins with distinct structures and functions. It has been shown that changes in isoform diversity might affect the phenotype, potentially leading to diseases, such as cancer or neurodegenerative disorders. These novel isoforms have been associated with cancer, including novel variants of oncogenes. For instance, novel isoforms and the usage of alternate promoters were found in cell lines in subtypes of gastric cancer.¹³ Moreover, a positive correlation was found between the presence of particular isoforms of Alzheimer’s disease-associated proteins and pathological severity.¹⁴ These findings might help with the development of specific diagnostic biomarkers.

Next Generation Sequencing (NGS), especially transcriptome sequencing (RNA-seq), including short- and long-read sequencing, characterizes gene expression at the isoform level and sheds light on biological processes inside the cell, as well as on how the cell responds to changes in the environment. Short-read sequencing is limited and provides low confidence when predicting alternative splicing, novel exons and junction sites, as well as the characterization of complete isoforms.¹⁵^,¹⁶ Long-read sequencing, however, is expected to overcome the inherent limitations of short-read sequencing. Analysis of long-read sequencing data, that uses multiple variant calling algorithms to detect insertions, deletions, inversions and duplications, found nearly 2.5 times more SVs than short-read data, with ~83% of insertions missing.¹⁷ Long-read sequencing technologies, e.g., PacBio High-Fidelity (HiFi) and Oxford Nanopore, have been shown to lower the error rates and to provide high-quality full-length reads that cover the entire length of all isoforms, removing any uncertainty in determining exon composition.

There exist tools that compare sets of short-read transcript annotations, including GffRead (RRID:SCR_018965) and GffCompare,¹⁸ AGAT,¹⁹ and bedtools²⁰ (RRID:SCR_006646). However, to our knowledge, no tool currently exists which takes advantage of the long-read sequencing data to compare inferred isoforms.

Here we present IsoComp, a bioinformatics pipeline to identify differences in isoform expression between individuals using long-read RNA-seq data. IsoComp can be used to make comprehensive comparisons of isoform profiles between multiple samples, with application in e.g. trio-sequencing or subtyping of cancer cell lines. We have tested IsoComp on HG002 samples (NA27730, NA24385, NA26105, NA24385, NA26105, NA27730) from GIAB (Genome in a Bottle)²¹ to demonstrate its potential use for distinct isoform detection and comparison.

SpikeVar & TykeVar: Mosaic variants simulation

In the context of individual genome comparison, mutations that appear with very low minor allele frequencies are referred to as rare variants.²² Similarly, mosaicism is observed when unique genetic differences arise at low frequencies within the population of cells in a tissue of an individual giving rise to mosaic variants (MVs).²³ Recent studies have shown potential disease implications for certain MVs.²³ Since MVs can have low allele frequencies and are mixed in with data from the non-mutated cells in the same sequencing output file, they can be challenging to detect as they may appear as noise in the NGS data. Therefore, several pipelines have been developed or adjusted to extract mosaic single nucleotide, structural or indel variants from whole-genome sequencing data, including Sniffles¹¹ (RRID:SCR_017619), DeepMosaic,²⁴ Mutect2,²⁵ DeepVariant.²⁶ To benchmark and validate the efficiency and accuracy of these methods, we developed two workflows called SpikeVar (Spike in Known Exogenous Variants) and TykeVar (Track in Your Key Endogenous Variants). As input SpikeVar takes data from two different sample sets, and mixes them in a user defined ratio. In contrast, TykeVar generates random mutations in the reference genome that are then introduced into existing sample data at a user defined ratio. Both these methods generate potentially low allele frequency, sporadic events in the resulting dataset.

The Pseudomonas Graph Genome Project (PGGP): Impact of bacterial diversity on alignment accuracy for antibiotic resistant organisms

Species evolve continuously, which causes a drastic shift in the architecture of their genome over time. To consider a single genome as a reference for different studies could produce biases. Bacterial evolution in particular involves different mechanisms like homologous recombinations, horizontal gene transfer (HGT) and mutations.²⁷ Conventional comparative genomic analyses that solely rely on linear reference sequences can introduce biases rooted in reference sequences and potentially disregard the spectrum of population or strain diversity. To mitigate these limitations, pangenomes using a graph genome approach have been proposed, encompassing entire genomes of different strains of species lying under one clade. This approach increases the accuracy of analyses by considering innately the concept that the existence of different haplotypes can cause drastic change in the overall study involving that species.²⁸

Pseudomonas aeruginosa (P. aeruginosa) is a Gram-negative bacterium and an opportunistic pathogen that has been deeply studied due to its significant role in causing serious health concerns in humans. P. aeruginosa presents high genome plasticity, possessing a significant assortment of genes acquired by HGT. These genes are frequently localized within integrons and mobile genetic elements, such as transposons, insertion sequences, genomic islands, phages and plasmids. This genomic diversity results in a non-clonal population structure, and consequently in highly variable strain phenotypes concerning virulence, drug resistance and morbidity. Consequently, this makes P. aeruginosa a prime candidate for a pan-genome graph approach.²⁹^,³⁰

Here, we made an amalgamation of several in-silico processes used to create a graph genome of P. aeruginosa, intended for open access within the scientific community. We also compared read alignments methods between a graph-genome and a standard linear genome approach.

SalsaValentina: Verification of de-novo SVs from trios

Mendelian inconsistencies are identified when a child has a genotype that is not possible given the genotypes of the parents, for example, when a child is homozygous for an allele that does not exist in either parent. Mendelian inconsistency in SV calls can indicate two possibilities: challenges in SV calling leading to false positive or negative calls across the trio, or a genuine de novo SV. De novo SVs are rare, with an estimated rate of 0.16 de novo SVs per genome in healthy individuals.³¹ Despite their rarity, de novo SVs have been associated with human disease, including Autism Spectrum Disorder, Pulmonary Alveolar Proteinosis and Alzheimer’s disease.³²^–³⁵ In addition, benchmarking studies have used the rarity of de novo SVs to support the validity of their SV calls under the assumption that calls inconsistent with Mendelian inheritance are likely incorrect.³⁶^–³⁸

We present SalsaValentina, a pipeline which identifies putative de novo SVs based on Mendelian inconsistency, and subsequently validates them using a local genome assembly of the region in question. SalsaValentina could assist in diagnosis variants underlying rare diseases, and inform other strategies for more accurate SV calling.

PhytoKmerCNV: Assembly-free gene copy number estimates from k-mer frequencies in whole-genome sequencing reads

Copy number variation (CNV) is a common form of SV polymorphism where segments of DNA are either duplicated or deleted when compared to a reference genome.³⁹ CNV plays a pivotal role in genome evolution⁴⁰ and has been associated with phenotypic diversity⁴¹ including human diseases.⁴² To date, the predominant strategies to detect CNVs using whole-genome sequencing data have relied upon analyzing the distribution of mapping coverage across the genome and identifying regions with outlying coverage compared to the background.⁴³ However, these coverage-based approaches are susceptible to ascertainment bias because they can only detect CNV of sequences present in the reference assembly, failing to capture the complete spectrum of CNV within a population. Furthermore, coverage-based CNV detection methods to detect CNV are dependent on high-quality genome assemblies, which are not available for many non-model systems.

As an alternative approach, we here present PhytoKmerCNV, a tool for estimating copy number of specific sequences using k-mer frequencies derived from whole-genome sequencing reads. Our approach involves comparing the k-mer frequency distributions between reads originating from the sequences of interest to the distribution of frequencies calculated from all sample reads.

SV-Genie: Mapping- vs. assembly-based SV calling evaluation

Recent studies have shown that SVs are widely present in human genomes. They shape the genome evolution and play important roles in human health and diseases by changing the protein-coding regions, cis-regulatory elements and gene expression profiles.⁴⁴ While a number of NGS-based SV detection tools have been developed in the past few years, it remains unclear how well these tools perform for the detection of SVs.⁴⁵

Currently available SV-calling tools roughly fall into three groups depending on the different types of input data used in the SV detection step: mapping-based, assembly-based and mapping-free methods. The mapping-based approaches use SV-related alignment features such as soft-clipped read ends, alignment breakpoints and discordant mates to detect SVs.⁴⁶ For the assembly-based methods, reads can be assembled directly into contigs (“global assembly”), e.g., in DISCOVAR⁴⁷ (RRID:SCR_016755) or reads could be aligned to a reference genome first and then reads aligning to each region could be assembled into contigs (“local assembly”), and the contigs (and the corresponding reads) are then aligned to the reference genome for SV calling.⁴⁸ The mapping-free approach works by checking the genomic signatures (e.g. k-mers) of known SVs directly in the raw NGS reads, as is done in Nebula.⁴⁹ The alignment-based approach has a number of advantages, including low computing resource requirement and shorter run-time in most cases. By contrast, the assembly-based approach generally requires substantially more computing resources and input sequencing data, but could take advantage of the longer input sequences from the assembled contigs and theoretically, could perform better for SV detection. The mapping-free approach is computationally efficient, but is limited to the genotyping of known SVs and will not be discussed further.

The performance of SV calling tools also depends on the targeted regions and read lengths. While it is generally believed that long read based SV calling tools have better sensitivity, a recent study shows that most large SVs in cancer can be detected without using long reads.⁵⁰ The targeted regions for SV calling can range from small gene panels to whole exome sequencing to whole-genome sequencing. Targeted panel sequencing and whole exome sequencing are currently dominating in clinical NGS testing due to cost advantages by focusing on protein-coding regions. However, genetic diagnosis by whole exome sequencing can only be made in 25-50% of cases.⁵¹ On the other hand, the whole-genome shot-gun sequencing (WGS) data has more uniform coverage and provides a comprehensive view across the coding regions, non-coding regions and intergenic regions. As the sequencing cost continues to drop, whole-genome sequencing is becoming increasingly cost-effective and can serve as a great starting point for detecting SVs and other genetic changes.

The goal of the SV-genie project was to develop a generalized framework to evaluate the performance of SV calling tools for WGS short-read dataset. Specifically, we use the Illumina short-reads data set for GIAB HG002 (ASJ son) as the input, run a number of alignment-based and assembly-based SV calling tools and compare the results with the GIAB HG002 SV dataset to gain insights into the performance difference between mapping-based and assembly-based approaches for the detection of SVs.

Methods

Implementation

IsoComp: Comparing Isoform composition between cohorts using high-quality long-read RNA-seq

The IsoComp pipeline identifies and compares distinct isoforms across multiple samples. Before running the IsoComp pipeline, GTF files need to be created and pre-processed to serve as input to IsoComp. The IsoComp algorithm is outlined in Extended data.

Iso-Seq analysis

The first step is to create GTF files to subsequently serve as input to IsoComp. First, demultiplexed HiFi reads (Q20, single-molecule resolution) from lima (https://github.com/pacificbiosciences/barcoding/) were processed using IsoSeq3 v3.2.2 (https://github.com/PacificBiosciences/IsoSeq). Next, the transcripts were mapped against the GRCh38 (v33 p13) reference genome using Minimap2⁵² (RRID:SCR_018550) (v2.24-r1122; command: minimap2 -t 8 -ax splice:hq -uf --secondary=no -C5 -O6,24 -B4 GRCh38.v33p13.primary_assembly.fa sample.polished.hq.fastq.gz). Then, cDNA_cupcake (v28.0.0) (https://github.com/Magdoll/cDNA_Cupcake) was used to filter away redundant isoforms from the BAM file, followed by filtering low-count isoforms by 10 and discarding 5’-degraded isoforms as they are not biologically significant. Afterwards, SQANTI3 v5.0⁵³ was used to generate the final FASTA transcripts and GFF files, as well as the isoform classification reports. The external databases including reference data set of transcription start sites (refTSS), list of polyA motif, tappAS-annotation and Genecode GRCh38 annotation were utilized during the isoform classification by SQUANTI3.⁵³ Finally, IsoAnnotLite (v2.7.3) (https://isoannot.tappas.org/isoannot-lite/) was used to annotate the GTF files obtained from SQUANTI3.⁵³ The workflow, shown in Extended data, outlines each step to generate the GFF files from individual samples of HG002, a necessary pre-processing step for subsequent comparisons using IsoComp.

Isoform clustering and comparison

In each GTF file created in the Iso-Seq analysis step, the ‘source’ column is replaced with the base filename (no extension) of the file. Next the GTF files are converted into PyRange objects, which are filtered on the ‘transcript’ feature. Then, overlapping ranges (the start to end coordinates of the transcripts) of clusters are determined. Clusters are sequentially numbered, so that each cluster comprises a discrete group of transcripts with overlapping ranges of coordinates.

In the next step, in each cluster, all isoforms are filtered by the coordinate overlap and compared against one another. If there is only one isoform in the cluster, it is reported as unique. Isoforms with unique or partial overlap of coordinate ranges within clusters are reported as unique, whereas isoforms with an identical overlap undergo a pairwise sequence comparison. The sequence comparison step allows for the detection of variability among isoforms within each cluster.

The workflow of the Isoform clustering and comparison step is presented in Figure 2.

Figure 2. The IsoComp workflow.

SpikeVar & TykeVar: Mosaic variants simulation

SpikeVar

The SpikeVar workflow outputs a mixed sequencing read dataset in the BAM format, containing reads from one dominant sample and reads from another sample spiked in at a user defined ratio corresponding to the simulated mosaic variant allele frequency (VAF), plus a VCF file annotating the confirmed mosaic variant locations within the mixed dataset (Figure 3(i)). The SpikeVarDatasetCreator takes aligned sequencing reads from sample A and sample B as input. In this step, a spike-in methodology is applied to strategically introduce x% of mutations from one sample to another using samtools⁵⁴(RRID:SCR_002105) view -s option. Accordingly, sample A is first down-sampled to retain (100-x)% of its original reads, then sample B is down-sampled to x% considering the coverage differences between the samples. Using samtools merge command, both down-sampled datasets are then merged to create a mixed dataset that represents a sequence read dataset with mosaic variants, including SVs, single nucleotide variations (SNVs), and insertions/deletions (indels).

Figure 3. The Mosaic Variants Simulation workflow.

i) SpikeVar workflow and ii) TykeVar workflow, with major steps to assess the sensitivity and accuracy of the mosaic variant callers. (A, B: individual samples, A/B: merged samples, .bam and .vcf: input and output file formats in different steps, Black header boxes: tool or file names, Green header boxes: simulated files or final files used for validation comparisons.)

The SpikeVarReporter then determines VAFs for each variant in the mixed dataset depending on the variant type and sequencing technology SNVs (samtools mpileup command), Sniffles2¹¹ (SV and long-reads) (RRID:SCR_017619), and Paragraph⁵⁵ (SV and short-reads), based on the mixed variant locations derived by merging the VCF files from sample A and sample B using samtools.⁵⁴ Variants with VAFs exceeding or equal to the introduced mutations (i.e., x%) are then selected to create a truth set for benchmarking using bcftools⁵⁴ (RRID:SCR_005227).

To assess a mosaic variant caller sensitivity and accuracy, the same mixed dataset is used to call mosaic variants. The output mosaic variant locations and VAFs are then compared to the truth set for validation.

TykeVar

The TykeVar workflow produces a modified aligned sequence file in the BAM format (Figure 3(ii)). This file contains modified reads simulating randomly positioned mosaic variants with user-defined VAF in random locations and is accompanied by a VCF file containing the locations of the simulated mosaic variants with user-defined VAF.

The TykeVar workflow can be broadly split into 3 parts:

1) The TykeVarSimulator takes an aligned BAM file, a reference and several parameters (such as range of VAF, variant sizes). to generate a set of simulated mosaic SVs and SNVs. It does so by choosing a random location and VAF from the given range and then evaluating whether that location has sufficient coverage for the desired VAF. If that condition is met, that variant is added to the output VCF file.
2) The TykeVarEditor is responsible for inserting the simulated variants into the query sequences from the original dataset to generate modified reads with the mosaic variants built-in. The TykeVarEditor accepts a BAM file, reference and the simulated VCF file as input. Then, for each variant, it fetches the overlapping reads from the BAM file, subsamples the reads to get the coverage that satisfies the desired VAF, and traverses the cigar string, query and reference sequences for each alignment to find the exact location to insert the variant. Once a modified read is created, it is written out into a FASTQ file. Note that for all new bases (SNVs or inserts), the q-score of 60 is chosen.
3) The TykeVarMerger re-introduces the modified reads into the original dataset. It does so by first removing the modified read IDs from the input BAM file to create a filtered BAM file. Then, the modified reads are aligned against the reference, and merged with the filtered BAM file. The end result is a BAM file with the same set of read IDs as the original dataset, except for some reads modified to contain the mosaic variants.

The output of this pipeline is thus a modified BAM file and a VCF file which provides the truth set for the mosaic variants.

The Pseudomonas Graph Genome Project (PGGP): Impact of bacterial diversity on alignment accuracy for antibiotic resistant organisms

The PGGP process involves two main simultaneous steps (Figure 4). For the pangenome approach, we initially constructed a graph genome by utilizing the accessible assemblies of Pseudomonas aeruginosa. This was accomplished using the pggb tool.⁵⁶ Then we performed graph alignments with GraphAligner⁵⁷ to align a dataset of short-read clinical isolates from Sequence Read Archive (SRA)⁵⁸ (RRID:SCR_004891) to our graph genome. Concomitantly, we downloaded a standard linear reference genome and performed read alignments with the same dataset used for the graph genome using BWA-MEM⁵⁹ (RRID:SCR_010910). Finally, we compared alignment efficiency between the two approaches. The details of the implementation of PGGP can be found in our Github repository (https://github.com/collaborativebioinformatics/SVHack_metagenomics). They consists of:

1. Graph Approach
- a. The graph genome construction: Out of 773 available assemblies, 499 complete P. aeruginosa genomes were downloaded from NCBI. Metadata for the included sequences is included in the linked GitHub repository - https://github.com/collaborativebioinformatics/SVHack_metagenomics/tree/2797b9eec54665258a67ef0277fbd4d06d4e26c7/assemblies_info. Pangenome graphs of varying sizes (5, 10, 20, 50, 100, 500 genomes) were created using the reference NC_002516.2 and complete genome assemblies such that all genomes in the smaller pangenome graphs are contained in the larger graphs. All pangenome graphs were created using the pggb tool (command: pggb -i assembly.fasta.gz -o. -t 16 -n 5 -p 90 -s 5000).⁵⁶
- b. Read mapping to pangenome: Read mapping to single reference Genome: 59 Illumina sequencing datasets including P. aeruginosa (NCBI Taxonomy ID 287) reads were downloaded from the Sequence Read Archive⁵⁸ (metadata for these samples is included in the linked GitHub repository - https://github.com/collaborativebioinformatics/SVHack_metagenomics/tree/2797b9eec54665258a67ef0277fbd4d06d4e26c7/reads_info). The reference sequence NC_002516.2 (Genome assembly ASM676v1) derived from the PA01 strain was also downloaded. Similar to the single reference genome, reads were mapped to pangenome graphs using GraphAligner (v1.0.10)⁵⁷ from the Docker image jmonlong/job-graphaligner:latest. Default values related to seeding and extension were used (options: -seeds-minimizer-count 5 --seeds-minimizer-length 19 --seeds-minimizer-windowsize 30 --seeds-minimizer-chunksize 100 -b 5 -B 10 -C 10000).
- c. Statistics related to alignment were extracted from the resulting GAM files using vg (1.23.0)⁶⁰ (RRID:SCR_024369) from the Docker image biocontainers/vg.
2. Linear Genome Approach
- a. Reference genome: The reference sequence NC_002516.2 (Genome assembly ASM676v1) derived from the PA01 strain was downloaded and indexed using bwa index ref.fa.
- b. Linear read alignment: An array of different genome assemblies of Pseudomonas aeruginosa were downloaded from the NCBI repository. This included the full-length genome of diverse strains isolated from different environments. A total of 59 Illumina sequencing datasets for P. aeruginosa (NCBI Taxonomy ID 287) that were retrieved from the Sequence Read Archive (metadata for these samples is included in the linked GitHub repository - https://github.com/collaborativebioinformatics/SVHack_metagenomics/tree/2797b9eec54665258a67ef0277fbd4d06d4e26c7/reads_info). BWA-MEM (BWA-0.7.17 [r1188])⁵⁹ was used for short-read alignment to the reference genome for all 59 datasets (command: bwa mem ref.fa read1.fq read2.fq > aln-pe.sam). Quality control statistics were obtained from output files using Picard (2.26.11) (http://broadinstitute.github.io/picard) (RRID:SCR_006525).

Figure 4. The Pseudomonas Graph Genome Project (PGGP) workflow.

SalsaValentina: Verification of de-novo SVs from trios

True de novo SVs are expected to be rare, however, in practice, a high rate of inconsistent SVs will be identified, indicating false positives or negatives due to noise inherent in SV calling and merging. SalsaValentina creates a ‘naive’ de novo SV candidate list, and develops a QC-framework tool that enables users to visualize the alignments in inconsistent SV regions across the trio and create a local assembly of every de novo SV candidate locus to aid in confirmation of the variant as either a de novo SV or an incorrect call.

SalsaValentina is an integrated pipeline for Mendelian inconsistency of SVs. We demonstrate the pipeline using the Genome in a Bottle (GIAB) Ashkenazim trio (HG002 son, HG003 father & HG004 mother) sequenced on Sequel II System with 2.0 chemistry and aligned to the GRCh38 genome reference. SVs are called using the Sniffles2 variant caller.¹¹

To merge the SV calls into a single VCF file, two methods are compared: multi-sample SV calling using Sniffles2¹¹ and variant merging with SURVIVOR⁶¹ (RRID:SCR_022995) using default parameters (https://github.com/fritzsedlazeck/SURVIVOR). Each of the resulting merged VCFs is annotated for Mendelian inconsistencies using the mendelian plugin to bcftools⁶² ( https://samtools.github.io/bcftools/howtos/plugin.mendelian.html). The positions of each SV inconsistent with Mendelian inheritance is extracted from the merged VCFs and samplot⁶³ is used to visualize the region of each variant in each member of the trio.

To further investigate the validity of the candidate variants, Mendelian inconsistent SVs were filtered to remove breakends (BNDs) and variants involving alternate contigs in GRCh38. The candidates from Sniffles¹¹ multi-sample calling were ranked by coverage. The candidates from SURVIVOR⁶¹ merging were filtered for variants with a ratio of variant reads to total read coverage between 0.3 and 0.7, resulting in 48 top candidates. For top ranking candidates, local assembly was performed by extracting reads aligned 50 kb upstream and downstream of the region of interest and assembling with Hifiasm⁶⁴ (command: --primary; to generate primary and alternate assembly) (RRID:SCR_021069). The YASS⁶⁵ Genomic Similarity Search Tool web server was used to create dotplots visualizing pairwise alignments of the resulting contigs to GRCh37 to verify the deletion in HG002. SalsaValentina workflow is shown in Figure 5.

Figure 5. The SalsaValentina workflow.

PhytoKmerCNV

PhytoKmerCNV takes raw whole-genome sequencing reads in the FASTQ format as input and produces k-mer distributions for both the total sample, as well as a captured subsample of sequencing reads. The captured reads correspond to sequences of interest in the genome that are captured based on alignment to a protein database. From the k-mer distributions, copy number estimates for the sequences of interest can then be derived from the captured subsample by comparing summary statistics calculated from the respective k-mer distributions for captured versus total reads.

Prior to the pipeline execution, the sequences of interest must be identified, converted to a protein FASTA file, and used to make a BLAST⁸⁴ protein database. The pipeline then begins with raw sequencing reads in the FASTQ format as input, which can be either uncompressed or compressed with gzip. The sequencing reads are adapter- and quality-trimmed using fastp (0.23.4)⁶⁶ with the default parameters. The processed FASTQ file is then converted to the FASTA format using seqtk (1.4).⁶⁷ The sequences are then queried against the protein database using blastx to identify reads putatively originating from the sequences of interest. The BLAST results are filtered to retain hits with a match length≥20 and the E-value<1. Reads with hits to the database are then extracted using samtools faidx. Canonical 21-mers are then counted in the matching reads, as well as the full sample of reads. The sum of k-mer frequencies found in the matching reads, as well as the total sample were then calculated using awk before being fed into an R script that derives copy number estimates from the calculated sums, as well as plotted the k-mer distributions.

As a practical example, we have developed this tool with the goal of estimating the copy number of resistance genes (R genes) for pathogen recognition in a collection of 32 resequenced tomato genomes.⁶⁸ R genes encode for proteins which recognize pathogen effectors in plants and are classified according to their domain organization, with nucleotide binding site (NBS) and extracellular leucine-rich repeat (LRR) domains.⁶⁹^,⁷⁰ We selected R genes to test our tool since they evolve rapidly, with copy number variation observed both between and within plant species.⁷¹^,⁷² PhytoKmerCNV workflow is shown in Figure 6.

Figure 6. The PhytoKmerCNV workflow.

SV-Genie: Mapping- vs assembly-based SV calling evaluation

To evaluate the performance of the mapping-based and the assembly-based SV calling methods, we select a number of SV calling tools and analyze Illumina short reads of whole-genome sequencing data to generate SV calls. Once the SV calls are available, the performance of the SV calling protocols are evaluated by comparison with an independently developed high-confidence truth (Figure 7). For this purpose, we used the GIAB HG002 (ASJ son) SV dataset³⁶ as the truth (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/). The findings could help us to assess the pros and cons for each method and to recommend an optimized SV calling protocol for NGS short reads.

Figure 7. The SV-Genie workflow.

We selected 2x250bp and 300X BAM files (70X coverage) from HG002 (Ashkenazim Trio Son, NA24385, ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.GRCh38.2x250.bam69ff2c644140bcf2396afed00907e24f; ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_Illumina_2x250bps/novoalign_bams/HG002.GRCh38.2x250.bam; ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/NIST_HiSeq_HG002_Homogeneity-10953946/NHGRI_Illumina300X_AJtrio_novoalign_bams/HG002.GRCh38.300x.bam as input for the mapping-based SV calling with Lumpy⁷³ (RRID:SCR_003253), Delly⁷⁴ (RRID:SCR_004603), Manta⁷⁵ (RRID:SCR_022997), Breakdancer⁷⁶ (RRID:SCR_001799), BreakSeq2,⁷⁷ CNVnator⁷⁸ (RRID:SCR_010821), Parliament2⁷⁹ (RRID:SCR_019187) and SURVIVOR.⁶¹ We also ran dysgu,⁴⁶ cue⁸⁰ separately for SV calling. For assembly-based SV calling, the BAM files were converted to FASTQ files and used as input for SVABA⁴⁸ (RRID:SCR_022998). The individual SV calls and the consolidated SV calls were then compared with the HG002 SV truth dataset to assess the performance by Truvari.⁸¹

The recently available Telomere-to-Telomere (T2T) genomes gave us the starting point for developing an alternative ‘self-alignment’ method for evaluating artifacts and false positives from variant callers including small variant detection and SV callers. Since this is a collinear synteny test, we designated this as the Colins Test. Briefly, read alignment was generated by aligning the T2T genome reads back to the corresponding finished T2T reference genome, and the alignment was then used as input for SV calling (see Extended data). SV calls from the SV callers are by definition artifacts and false positives. This alternative performance metric for SV callers serves as an independent complementary approach to the stats generated by Truvari.⁸¹ This self-alignment method is also different from SalsaValentina (discussed above), since the self-alignment method does not require trio info to work.

In the Colins Test for GIAB HG002, we mapped GIAB HG002 reads back to the HG002 T2T phased genome assembly (both maternal and paternal haplotypes included) to assess how the Colins Test would perform. We also aligned the HG002 T2T phased genome assembly to the HG19/GRCh37 chr22, and aligned the HG002 T2T maternal to the paternal genome assembly, so then we could inspect the NGS alignments at the matched locations, with the intent of finding the source of errors in the reference based SV calls.

Operation

IsoComp: Comparing Isoform composition between cohorts using high-quality long-read RNA-seq

The IsoComp pipeline is written in python and it can be implemented in any Linux-based environment with Python>3.9. After installing IsoComp from PyPI with pip (command: pip install isocomp), the program is ready to use in the command-line terminal. The IsoComp pipeline consists of two steps, run one after the other. The first step (Creating windows) takes GTF files of multiple samples as input and produces a GTF file with clustered isoforms. The second step (Finding unique isoforms) takes the clustered GTF file created in Step 1 as input, as well as a CSV file with information about sources and the FASTA files of particular samples. This step produces a CSV file with unique isoforms. Dependencies include SQANTI3,⁵³ minimap2 (v2.24-r1122),⁵² samtools (v1.15.1),⁵⁴ Isoseq3 (v3.2.2) (https://github.com/pacificbiosciences/isoseq/), poetry (v1.6.1), pandas (v1.5.1), biopython (v1.80), pysam (v0.20.0), edlib (v1.3.9), numpy (v1.16), matplotlib (v3.7.1), tqdm (v4.66.1). The code is available inn our GitHub repository: https://github.com/collaborativebioinformatics/isocomp.

SpikeVar & TykeVar: Mosaic variants simulation

The SpikeVar workflow requires Python>3.6.8. The SpikeVar workflow includes two major steps. First, the SpikeVarDatasetCreator takes aligned sequencing reads from two samples to strategically introduce x% of mutations from one sample to another using mosdepth (0.3.2)⁸² (RRID:SCR_018929) and the two down-sampled datasets are then merged using samtools (1.15.1)⁵⁴ to create a mixed dataset that represents a sequence read dataset with mosaic variants. In the second step, the SpikeVarReporter determines VAFs for each variant in the mixed dataset using bcftools (1.18)⁵⁴ based on the mixed variant locations derived by merging the VCF files from sample A and sample B. Variants with VAFs exceeding or equal to the introduced mutations (i.e., x%) are then selected to create a truth set for benchmarking.

The TykeVar package has been tested using Python>3.10. The TykeVar workflow can be broadly split into 3 parts. The TykeVarSimulator takes an aligned BAM file, a reference, and several parameters (such as range of VAF, variant sizes) to generate a set of simulated mosaic SV and SNVs. It does so by choosing a random location and VAF from the given range and then evaluating whether that location has sufficient coverage for the desired VAF. If that condition is met, that variant is added to the output VCF file. The TykeVarEditor is responsible for inserting the simulated variants into the query sequences from the original dataset to generate modified reads with the mosaic variants built-in. For each variant, it fetches the overlapping reads from the BAM file, subsamples the reads to get the coverage that satisfies the desired VAF, and traverses the cigar string, query and reference sequences for each alignment to find the exact location to insert the variant using pysam (0.21.0). Once a modified read is created, it is written out into a FASTQ file. Note that for all new bases (SNVs or inserts), the q-score of 60 is chosen. The parsing and traversing of the VCF, BAM and reference files are performed using APIs from pysam, biopython. SeqIO and NumPy. Lastly, the TykeVarMerger re-introduces the modified reads into the original dataset using minimap2 (v2.24-r1122)⁵² and bwa-mem2 (v2.2.1).⁵⁹ Additional non-standard dependencies include NumPy (1.25.2), and BioPython (1.81), all of which are available through the pip package management system.

The Pseudomonas Graph Genome Project (PGGP): Impact of bacterial diversity on alignment accuracy for antibiotic resistant organisms

PGGP is not a workflow that can be run, but an analysis pipeline for a graph genome approach applied to P. aeruginosa. The P. aeruginosa graph genome obtained can also be downloaded from that page, and it is available to the scientific community. This pipeline can be replicated following the instructions in our Github repository (https://github.com/collaborativebioinformatics/SVHack_metagenomics). To re-run this pipeline, we suggest using pggb⁵⁶ inside a Docker container. Instructions on how to run pggb⁵⁶ in these conditions can be found in the pggb Github repository (https://github.com/pangenome/pggb). If the user wants to replicate read alignments with GraphAligner,⁵⁷ even though it was not adequate for short reads (see Results), we recommend running it from inside a Docker container (https://github.com/maickrau/GraphAligner). Other aligners, such as vg Giraffe⁶⁰ might be more adequate for short reads, however. For a linear genome alignment, we used the reference sequence NC_002516.2 (Genome assembly ASM676v1) as a reference genome, and performed alignment with BWA-MEM (BWA-0.7.17 [r1188]).⁵⁹ Lastly, vg⁶⁰ was used to obtain statistics from the graph alignments, while Picard tools (2.26.11) (http://broadinstitute.github.io/picard) was employed to assess the quality and the statistics of the aligned reads aligned to the graph or linear genomes respectively. For more details, see the Github repository (https://github.com/collaborativebioinformatics/SVHack_metagenomics).

SalsaValentina: Verification of de-novo SVs from trios

SalsaValentina is implemented as a Snakemake pipeline, requiring the sample names and BAM files of the mother, the father, and the child (trio) and a reference genome. The pipeline calls SVs using Sniffles (v2.0.7),¹¹ then merges the variant calls across the trio using both Sniffles¹¹ multi-sample calling and SURVIVOR (v.1.0.7).⁶¹ Mendelian inconsistencies are identified with bcftools mendelian plugin (v1.17)⁶² and automatically visualized with samplot (v1.3.0).⁶³ The pipeline outputs merged VCF files, a text file of Mendelian inconsistencies, and a PDF file with visualizations from samplot.⁶³ Additional scripts, dependent on the Python version (v3.9) and the Pandas module (v2.1.3), enable filtering of the de novo SVs identified by either Sniffles¹¹ or SURVIVOR⁶¹ and local assembly of selected regions of interest using samtools to extract regions from the BAM file and hifiasm (v0.19.6).⁶⁴

PhytoKmerCNV

The main PhytoKmerCNV pipeline is provided as a Bash script that can be deployed on any Unix system, although the pipeline was initially built and executed on AWS via DNAnexus. Software dependencies can be installed via conda/bioconda using the provided YAML file and include Python≥3.10.12, R≥4.3.2, jellyfish (2.3.0)⁸³ (RRID:SCR_005491), ncbi-blast+ (2.14.1),⁸⁴ seqtk (1.4),⁶⁷ sra-tools (3.0.7) (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software), samtools (1.17),⁵⁴ fastp (0.23.4),⁶⁶ and awk. Further dependencies include the following R libraries: pacman, ggplot2, cowplot, and ggpubr.

SV-Genie: Mapping- vs. assembly-based SV calling evaluation

The SV-Genie pipeline has two major branches executed via shell scripts. The first branch is for the mapping-based SV calling, which uses BAM files as input for SV calling with Parliament2 (v0.1.11),⁷⁹ executing SV callers such as Lumpy (v0.2.13),⁷³ Delly (v0.7.2),⁷⁴ Manta (v1.4.0),⁷⁵ Breakdancer (v1.4.3),⁷⁶ CNVnator (v0.3.3),⁷⁸ and followed by SURVIVOR (v.1.0.7).⁶¹ This branch also runs dysgu (v1.6.0)⁴⁶ and cue (v0.2.2)⁸⁰ for SV-calling. The second branch is for the assembly-based SV calling, where the BAM files are converted to FASTQ files and used as input for the assembly-based SV calling via SVABA (v1.1.3).⁴⁸ The individual SV calls and the consolidated SV calls are then compared with the SV truth dataset to assess the performance by Truvari (4.1.0).⁸¹ To evaluate the impact of different read coverage on SV calling, we used seqtk (1.4)⁶⁷ or samtools (1.17)⁵⁴ to generate decimated bam files as input for SV calling.

Results and use cases

IsoComp: Comparing isoform composition between cohorts using high-quality long-read RNA-seq

In our approach, we aim to compare transcripts based on their composition rather than solely relying on coordinates, distinguishing our method from others. We have developed a tool called Isocomp, publicly available on GitHub at “https://github.com/collaborativebioinformatics/isocomp” under the MIT License. IsoComp can be installed using pip (command: pip install isocomp==0.3.0) and comprises two main steps. The initial step involves creating comparison windows (command: isocomp create_windows), which takes a GTF file for the samples to be compared and a transcript file as input, producing a cluster file for all samples that serves as a seed for the subsequent step. The next step utilizes the IsoComp algorithm (command: isocomp find_unique_isoforms) and the output from the previous step to compare shared transcripts between samples based on the composition. This step outputs unique transcripts that may overlap with other transcripts in the compared samples but differ in sequence composition.

We applied our tool to two publicly available samples, namely NA24385 and NA26105, and successfully differentiated isoforms between those two samples. However, further development is required.

The refinement and clustering steps generated in the Iso-Seq analysis step full-length HQ transcripts of high quality (i.e., predicted accuracy≥Q20) are shown in Table 1, Table 2 and Table 3

Table 1. IsoComp: Statistics of HQ FASTQ reads.

Samples	Number of sequences	Min. length	Avg. length	Max. length	N50	Q20 (%)	Q30 (%)
NA26105 (HG002.3)	205,590	51	3298.8	14887	3817	99.95	99.92
NA27730 (HG002.2)	205,884	52	3093.6	12524	3695	99.96	99.93
NA24385/HG002 (HG002.1)	411,349	50	2171.2	10767	2675	99.96	99.93

Table 2. IsoComp: Statistics of isoforms mapped with Minimap2.

Samples	Total number of reads	Alignment (%)	Mapped reads	Unmapped reads
NA26105	MM2	99.04	203607	1983
NA27730	MM2	99.35	204553	1331
NA24385	MM2	99.65	409905	1444

Table 3. IsoComp: Basic statistics of filtered transcripts from SQUANTI3.

Sample	Number of genes	Number of transcripts	Number of exons
NA26105	5433	7632	72940
NA27730	5459	7765	74906
NA24385/HG002	5656	9710	51196

Table 2 shows the basic statistics of the alignment of each sample (HG002: NA27730, NA24385, NA26105) to the reference genome.

Table 3 shows the isoform classification report generated by SQUANTI3.

The output results (supplementary table 4) were performed by running IsoComp on DNAnexus (total running time: 15 mins, with 16 CPU and 8 GB RAM). Our tool can find and compare intervals, multithreading, easy to install and generate a convenient TSV output file.

In Figure 8, we represent the cluster sizes, which indicate the number of transcripts whose intervals overlap and how many clusters contain such transcripts. We obtained this data by comparing three replicates of the HG002 sample, which is publicly available in GIAB (ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/Ultralong_OxfordNanopore/guppy-V3.2.4_2020-01-22/HG002_hs37d5_ONT-UL_GIAB_20200122.phased.bam).

Figure 8. Cluster size distribution.

SpikeVar: Successful spike-in of HG002 reads into HG00733 BAM file

Using the SpikeVar workflow, we successfully spiked in the Genome in a Bottle (GIAB) sample HG002 (ASJ son)³⁶ (https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/) at a 5% concentration into sample HG0733 (Puerto Rican female) (ftp://ftp.sra.ebi.ac.uk/vol1/run/ERR398/ERR3988823/HG00733.final.cram), to result in a 5% mosaic variant allele frequency (VAF). Figure 9(i) displays the successful detection of a deletion of 287 bp on chromosome 5 at a 5% rate. This deletion originated from the HG002 sample and was not present in the original HG00733. Similarly, Figure 9(ii) shows a 341 bp long insertion at a 5% VAF originating from HG002.

Figure 9. SpikeVar results.

Screenshot for SVs (i) deletion and (ii) insertion from HG002 and was successfully spiked in the HG00733 BAM.

TykeVar: Sniffles successfully detects mosaic SVs introduced by simulated ultra-long ONT reads

We successfully used the TykeVar workflow to modify reads of HG002 directly at their reference position by including artificial mutations. To demonstrate the wide application of our tool, we generated a random distribution of allele frequencies between 1% and 40%, which can be seen in Figures 10(i) and (ii). Figure 10(i) shows the simulation of a 5952-bp long insertion which was simulated at 22% VAF, while Figure 10(ii) shows the simulation of a A>T mutation on chromosome 22 at 8% VAF.

Figure 10. TykeVar results.

(i). A mosaic insertion introduced into the reads by modifying a subset of the reads. ii). A mosaic SNP introduced into the reads by modifying a subset of the reads.

Use cases for SpikeVar and TykeVar

The simulated data generated by both SpikeVar and TykeVar workflows include simulated SNVs, SV, and indels and is therefore optimal for the comparison and benchmarking of different mosaic variant callers. Mosaic variant caller sensitivity and accuracy can be determined for variable VAFs and read coverages to determine minimum requirements for detection. To validate long SVs and indels detection, SpikeVar is a more suitable workflow for the creation of a validation dataset as it uses naturally occurring variants for spike-in and is not restricted by the read length. Moreover, in simulated datasets created by the TykeVar workflow, haplotypes remain unchanged and therefore TykeVar datasets are more suitable for phasing dependent callers. Both SpikeVar and TykeVar can be applied to long- and short-read whol-genome sequencing files. Hence, different technologies (ONT, Pacbio, Illumina) can be assessed for their suitability of mosaic variant detection.

The Pseudomonas Graph Genome Project (PGGP): Impact of bacterial diversity on alignment accuracy for antibiotic resistant organisms

For the graph genome construction, we built graph representations for Pseudomonas aeruginosa using 5, 10, 20, 50, 100, and 500 genomes (see Extended data). As more assemblies are used, the graph grows in complexity and more regions of the genome are accessory regions and not core. Observing the sequenced alignments of the 20 assemblies used to create that graph, one can see big chunks of the sequences being absent from most of the different isolates, while others are present in all. The assembly accession numbers with additional details can be found in the metadata table in our GitHub repository: https://github.com/collaborativebioinformatics/SVHack_metagenomics.

Linear Read Alignment

As part of the PGGP project, we wanted to compare our graph approach to the regular linear genome alignment approach. We obtained 59 Illumina read datasets from SRA and aligned them to the ASM676v1 reference genome (see Methods). We can observe the alignment efficiency for this approach (see Extended data). To gauge graph alignment accuracy, we also developed a method for graph based MAPQ sampling accuracy check. The quality scores provided in the GAM files are tricky to interpret as we have few details about the calculation method used here and it might be different from linear alignments MAPQ scores calculation methods.⁵⁷ To count the number of aligned reads on the graph, a criteria for a valid aligned read needs to be defined. Here we look at the distribution of the quality scores found for one of our GAM files to get an idea of its range and have a better overall insight of the threshold. This helps to choose an appropriate threshold to filter the aligned reads and count the one that passes (see Extended data). Based on the distribution of the score, we plotted the selectivity of the filter based on the value of the chosen threshold (see Extended data).

Finally, a comparative analysis between the aligned reads between the pangenome and a single reference genome for different assemblies was done, which shows a better view of the difference in the analysis using a single reference genome and a pangenome (Figure 11).

Figure 11. Pangenome and single reference genome comparison.

A comparison between the aligned reads between the pangenome and a single reference genome for different assemblies.

Use cases for PGGP

Our pipeline offers a valuable resource for the scientific community: a collection of graph genomes using different numbers of isolates, representing different iterations of the Pseudomonas aeruginosa pangenome. These graphs were constructed using a carefully curated list of publicly available genome assemblies from bacterial isolates. The list is also provided alongside the graph genomes, allowing for transparency and reproducibility. Additionally, the pipeline provides a flexible framework for generating various iterations of these graph genomes. This enables users to experiment with different parameters and isolate selections, further tailoring the resource to their specific research needs.

SalsaValentina: Verification of de-novo SVs from trios

SalsaValentina compares two different methods of merging SV calls within the trio: multi-sample calling using Sniffles¹¹ and merging using SURVIVOR.⁶¹ The two methods give different numbers of overall SV calls within the trio, as well as percentages of SVs that are inconsistent with Medelian inheritance. We found a total of approximately 32,000 SV calls in our merged call set using either the Sniffles¹¹ multi-sample calling or SURVIVOR.⁶¹ For the Sniffles¹¹ multi-sample calling, 5.2% of these were Mendelian inconsistent, while for SURVIVOR⁶¹ 2.4% were inconsistent (Extended data). The different number of inconsistent SV calls between the two methods is due to differences in genotype assignment between the tools, with SURVIVOR⁶¹ treating some variants as missing, whereas Sniffles¹¹ reports them as reference.

A Mendelian inconsistent deletion was identified in HG002 at chr7:142,786,222-142,796,849 by the Sniffles¹¹ multi-sample calling method (Figure 12(A)). This is in the T-cell receptor beta locus and thus, likely the result of somatic recombination rather than a de novo germline variant. However, it can still be used to demonstrate the usability of our method. This deletion was called heterozygous with 12 reads supporting the reference and 13 supporting the variant in HG002, while it was homozygous reference supported by 45 and 44 reads respectively in HG003 and HG004. In addition, GIAB previously reported a de novo deletion in HG002 at chr17:51,417,826–51,417,932 using GRCH37 reference as part of their v0.6 SV benchmark set, which was derived from high confidence calls supported by multiple methods.³⁶ This deletion was also identified in this study, at chr17:53,340,465-53,340,571 when using GRCH38 as reference (Figure 12(B)). This heterozygous deletion was supported by 30 reads and the reference at this location by 27 reads, while the parents had only reads supporting the reference allele (65 in HG003 and 72 in HG004).

Figure 12. Potential de novo deletions visualized in samplot.

Candidate de novo deletions at (A) chr7:142,757,892-142,824,789 and (B) chr17:51,417,826–51,417,932. The top panel shows a deletion in HG002, which is absent in the parents (father HG003 middle panel, and mother HG004 bottom panel).

Use cases for SalsaValentina

SalsaValentina can aid users in identifying and confirming SVs that demonstrate Mendelian inconsistency. We envision two primary use cases. First, genuine de novo SVs may be candidates for rare disease diagnosis. Second, candidate SVs that are mendelian inconsistent due to coverage issues or variant calling inconsistencies help to inform the error modes of sequencing and software for SV calling, to refine methods for accurate SV calling.

PhytoKmerCNV

We executed the PhytoKmerCNV pipeline on a dataset of 32 resequenced tomato genomes and compared the CNV estimates produced by the pipeline with empirical NBS-LRR genes parsed from genome annotations. Briefly, we counted the number of genes with the NBS-LRR domains among annotated peptides for each genome and used the resulting value as a relative ground truth value, upon which the k-mer estimates were compared to determine the accuracy of our gene copy number predictions. The corresponding results are shown in Figure 13. Our approach did not yield a significant result, thus we were unable to confidently infer the copy numbers for each NBS-LRR gene in the resequenced tomato genome. There is much room for further refinement of PhytoKmerCNV.

Figure 13. PhytoKmerCNV results.

The regression analysis shows the relationship between the number of NBS-LRR genes estimated from the gene annotations compared to the 21-mer based abundance estimates derived from captured reads. The R² value is 0.011 indicating a very low correlation between the above variables, while the p-value of 0.58 suggests that the results are not statistically significant.

Use cases for PhytoKmerCNV

There are three potential use cases of PhytoKmerCNV. First, it would be interesting to compare CNV patterns of the NBS-LRR genes within a species and/or across multiple plant species, highlighting variations and conserved patterns. Second, it would be informative to be able to estimate CNV in a non-model plant genome lacking extensive resequencing data, especially including gene identifiers and inferred copy numbers. Third, extending this approach to assess CNV using low-pass sequencing data would reduce associated costs. These practical use cases illustrate the versatility of PhytoKmerCNV for CNV analysis in plant genomes. Researchers can adapt the tool to address a wide range of research questions, from investigating genetic diversity to understanding the functional implications of CNVs in plant biology.

SV-Genie: Mapping- vs. assembly-based SV calling evaluation

We ran the SV-Genie pipeline on the GIAB HG002 2x250bp WGS Illumina short-reads as a use case. Specifically, we executed the mapping-based SV-calling including Parliament2,⁷⁹ dysgu⁴⁶ and cue,⁸⁰ as well as the assembly-based SV-calling SVABA.⁴⁸ We then compared the SV calls with the GIAB HG002 SV reference data set v0.6³⁶ via Truvari.⁸¹ The performance stats are summarized in Extended data.

This use case gave us a number of insights:

1. SV callers have false negative rate of 50-60% or higher and false positive rate is also high, consistent with previous observations.⁴⁶^,⁷⁹
2. SV callers have much better performance for deletions than for duplications/insertions, suggesting more challenges for duplication/insertion SV calling.
3. Parliament2⁷⁹ is designed to launch all six included SV callers, but two of these SV callers (breakdancer⁷⁶ and lumpy⁷³) failed to run, even though all these SV callers were included as part of the Docker image.
4. Parliament2⁷⁹ generated final SV calls from the four successful SV caller runs (breakseq2,⁷⁷ cnvnator,⁷⁸ delly2,⁷⁴ and manta⁷⁵) even when two of the included SV caller failed to generate any results.
5. The dysgu⁴⁶ SV caller alone out-performed Parliament2⁷⁹ with better recall and F1-score, even though Parliament2⁷⁹ integrated the results from multiple SV callers.
6. The SV calling performance has a strong dependency on the coverage, where the 70X coverage for 2×250bp data has the best performance. However, the 300X coverage dataset has significantly worse performance, suggesting the possibility of an optimal coverage for SV calling, or the read length difference between the 300X data and the 2×250bp data.
7. SVABA⁴⁸ is the only assembly-based SV caller evaluated, but its performance with default settings is far worse than the mapping-based SV callers. SVABA⁴⁸ for 2×250bp data (70X) completed in a few hours by just assembling clipped/discordant/unmapped/gapped reads (default). When the -r all option is turned on to generate assembly for the whole genome, the job ran for seven days without completion on a large DNAnexus instance (mem2_ssd1_v2_x96: 375 GB total memory, 3348 GB total storage, 96 cores).

To complement the performance metrics generated by Truvari,⁸¹ we created a genome browser instance containing the read alignment with T2T reference genome to demonstrate manual review and confirmation. The screenshots shown in Figure 14 and Extended data were created with an automated script that prepared the alignments and the genome browser setup.

Figure 14. Maternal vs paternal HG002 T2T visualization for chr22.

Notably, the maternal chromosome is 4 Mbp longer than the paternal chromosome.

Conclusion and next steps

IsoComp: Comparing Isoform composition between cohorts using high-quality long-read RNA-seq

While various tools exist for comparing transcripts, they often overlook the significance of the transcript sequence order. To address this gap, we have devised a prototype algorithm capable of comparing transcripts by considering both their coordinates and sequence compositions. Our ongoing objective is to optimize the algorithm for scalability, with a specific focus on employing memory-efficient techniques suitable for extensive projects, such as the 1000 Genome Project. This optimization will involve refining cluster algorithms and adopting an alignment-free methodology to facilitate transcript comparisons across diverse samples. Presently, our algorithm generates a basic table indicating shared and unique transcripts. However, our ultimate future goal is to exploit this table to provide more comprehensive insights, including information on the spatial relationships between transcripts within each sample. This advancement aims to enhance the depth of analysis and contribute to a more nuanced understanding of transcriptomic data.

SpikeVar & TykeVar

SpikeVar and TykeVar successfully enabled creation of simulated genomic data containing known mosaic variants with VAFs between 5% and 20%. To our knowledge, these are the first workflows that simulate mosaic variants for the benchmarking and quality control of mosaic variant callers. The strengths of our workflows include a rapid and reproducible creation of simulated genomic truth datasets with accompanying index VCF files with mosaic variant location and VAF. In addition, both files are widely compatible with mosaic variant callers and provide a variety of different types of variants including SNVs, SVs, and indels and both workflows require only basic packages and are therefore easily installed and implemented.

Benchmarking mosaic variant callers is essential in order to generate reliable data for evaluation of disease associations of mosaic variants. Therefore, we plan to convert both TykeVar and SpikeVar into a one-step tool for the generation of simulated data. We also want to enable the user to have the option to define a global VAF as well as variant specific VAF. In a final step, we will compare our simulated data with data of physically mixed and sequenced samples.

The Pseudomonas Graph Genome Project (PGGP): Impact of bacterial diversity on alignment accuracy for antibiotic resistant organisms

Pangenomes have been used in the past to elucidate the core genomes of pathogens, to improve detection of horizontal gene transfer events, and to study their evolutionary trajectories in different environments. These applications greatly expand our understanding of bacterial and host-pathogen dynamics with practical applications to both medicine and agriculture.

Metagenomics and sequencing of clinical isolates are gaining traction for identification of antimicrobial resistance profiles and diagnosis. Pan-genomics can greatly benefit these clinical applications. Creating pangenomes provides additional insight into pathogen evolution and transmission. For example, including local isolates in pangenomes could inform outbreak investigation efforts and lead to improved infection prevention within hospital systems. Mapping reads directly to pangenomes is a recent advance that may improve detection of polymorphisms related to antimicrobial resistance or virulence. Examining practical considerations and comparison with standard practices demonstrates the promise of alignment to pangenomes and drawbacks.

SalsaValentina: Verification of de-novo SVs from trios

SalsaValentina enabled identification of putative de novo SVs, two of which were investigated in further detail. One was determined to occur in the T-cell receptor locus, and thus is a likely somatic event that may not be interesting for the use cases of de novo disease associated SVs or variant calling refinement. However, we were able to verify the deletion in HG002 using a local assembly, demonstrating the capability of the pipeline. In future, results could be restricted to certain regions of interest in the genome, excluding known recombination regions. Furthermore, visualization of the candidate de novo SVs could aid in screening candidates in problematic regions or regions of interest. In addition, we observed and confirmed a previously reported de novo SV in HG002, at chr17:51417826–51417932. This variant was identified as part of a comprehensive benchmarking effort for HG002 and thus demonstrates the ability of SalsaValentina to identify genuine events.³⁶ One potential limitation of our local assembly to verify the SV calls is that we used only reads that mapped near the putative SV. In the future, we recommend including unmapped reads in the assembly to ensure reads that failed to map may be incorporated into the contigs.

PhytoKmerCNV

We set out to fill a critical niche within plant genomics by building a pipeline for analyzing genomes that have not been extensively resequenced, particularly non-model plant systems with limited genomic resources. Our method, PhytoKmerCNV, demonstrates great promise as a reference-free approach for genotyping copy number variation using whole-genome sequencing reads. Moreover, its ploidy-agnostic nature would make it adaptable to genomes with varying levels of ploidy. Such k-mer-based approaches, known for their sensitivity, offer the intriguing possibility of low-pass sequencing, potentially opening up new avenues for exploring CNV dynamics albeit being rather risky. The proposed tool is dependent on the ability to identify reads which match a specific protein domain. Therefore, one of the major risks of this approach is the potential to miss reads originating from sequences of interest but missing the captured protein domain.

Moving forward, several promising improvements can be made. Currently, the existing pipeline generates a k-mer based estimate from the ratio of the sum of k-mers in the NBS-LRR genes captured reads over the sum of k-mers in the total sample. One challenge this approach must overcome is the incredibly high sum of k-mers which return a count of 1. In future hackathons, the sum could be substituted with other summary statistics such as mean, median, and/or statistical methods which account for the inflated counts. Additionally, several other types of gene families could be studied (e.g., transcription factors with specific DNA binding domains, cytochrome P450s, kinase families, heat shock proteins, pathogenesis-related proteins, MADS-box genes, ABC transporters, RNA-binding proteins, and/or late embryogenesis abundant proteins). Each of these have their own pros and cons and must be considered with respect to the type of biological data.

Upon further modification and refinement, we expect the pipeline to generate accurate CNV estimations and help researchers observe the range of variability of CNV across plant genomes. Some genomes may have higher copy numbers of specific NBS-LRR genes, while others may have fewer copies, reflecting the natural genetic diversity within the tomato species. Additionally, an optimized method would allow for further validation of the previously annotated NBS-LRR genes in genome assemblies, and potentially highlight novel gene copies or variants which were not initially annotated. Extending the application of PhytoKmerCNV to a broader range of plant species, especially those with unique genomic characteristics, will enhance its utility in diverse research contexts. Incorporating phenotypic data alongside the CNV analysis can uncover genotype-phenotype relationships, shedding light on the functional significance of CNVs, particularly in the context of disease resistance and other traits.

In the future, we believe that it would be valuable to develop a user-friendly interface and detailed documentation to make the tool more accessible to researchers with varying levels of computational expertise. And finally, we encourage collaboration and feedback from the research community on this particular approach with hopes of fostering improvements and adaptations of PhytoKmerCNV to meet evolving research needs.

SV-Genie

SV-Genie provides a generalized framework for evaluating the performance of SV-calling tools. Our analysis shows that SV-calling tools still have a long way to go, since even the best performing SV caller (dysgu⁴⁶) has a modest recall rate and a high false positive rate, confirming previous reports.⁴⁶^,⁷⁹ Another issue is that some SV callers (cue⁸⁰ and SVABA⁴⁸) have far worse performance with default settings, compared with what was stated in the original publications. In addition, SVABA⁴⁸ assigns BND as the SVTYPE for all SV calls, but does not provide a script for converting the SV caller output to the standardized VCF format with the standard ‘SVTYPE (e.g., DEL, DUP, INS), while Parliament2⁷⁹ has different chromosome naming conventions. Looking forward, it is strongly recommended that all SV callers should have a minimum set of standardized fields such as SVTYPE, END, SVLEN to make it more straightforward for the end user to use and for SV caller performance evaluation.

Ethics and consent

All data and software used in this study is open-source.

Data and software availability

In this study we used the following data:

Genome in a Bottle: Genomic data. Accession numbers HG002 samples (NA27730, NA24385, NA26105, NA24385, NA26105, NA27730), HG002 son, HG003 father, HG004 mother, HG0733 Puerto Rican female; Data available from: https://www.nist.gov/programs-projects/genome-bottle 21

Telomere-to-telomere consortium HG002 “Q100” project: Genomic data; Data available from: https://github.com/marbl/HG002 under the [CC0] license.

Genome assembly GRCh37: Reference genome; Data available from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.13

Software availability

IsoComp: All code is provided at the following Github repository: https://github.com/collaborativebioinformatics/isocomp. The code is available under the [MIT] license.

SpikeVar/TykeVar: All code is provided at the following Github repository: https://github.com/collaborativebioinformatics/SpikeVarTykeVar. The code is available under the [MIT] license.

Pseudomonas Graph Genome Project (PGGP): All code is provided at the following Github repository: https://github.com/collaborativebioinformatics/SVHack_metagenomics. The code is available under the [MIT] license.

SalsaValentina: All code is provided at the following Github repository: https://github.com/collaborativebioinformatics/SVHack_Mendelian. The code is available under the [MIT] license.

PhyoKmerCNV: All code is provided at the following Github repository: https://github.com/collaborativebioinformatics/PhytoKmerCNV. The code is available under the [MIT] license.

SV-Genie: All code is provided at the following Github repository: https://github.com/collaborativebioinformatics/SVHack_assemblyvmapping. The code is available under the [MIT] license.

Extended data

Zenodo:

IsoComp (https://github.com/collaborativebioinformatics/isocomp): 10.5281/zenodo.11111370

SpikeVar/TykeVar (https://github.com/collaborativebioinformatics/SpikeVarTykeVar): 10.5281/zenodo.11111360

Pseudomonas Graph Genome Project (PGGP) (https://github.com/collaborativebioinformatics/SVHack_metagenomics): 10.5281/zenodo.11111368

SalsaValentina (https://github.com/collaborativebioinformatics/SVHack_Mendelian): 10.5281/zenodo.11111363

PhyoKmerCNV (https://github.com/collaborativebioinformatics/PhytoKmerCNV): 10.5281/zenodo.1111357

SV-Genie (https://github.com/collaborativebioinformatics/SVHack_assemblyvmapping): 10.5281/zenodo.11111365

Acknowledgements

The authors would like to thank Baylor College of Medicine (BCM), BCM administrators and administrative staff, DNAnexus (for compute and the time of BB), PacBio, Oxford Nanopore Technologies and Todd Treangen. We would also like to thank the scientists who have contributed to aspects of these projects in previous hackathons, particularly the 2022 IsoComp Team and Neda Ghohabi Esfahani.

References

1. Smolka M, Paulin LF, Grochowski CM, et al.: Detection of mosaic and population-level structural variants with Sniffles2. Nat. Biotechnol. 2024. PubMed Abstract | Publisher Full Text
2. Mahmoud M, Gobet N, Cruz-Dávalos DI, et al.: Structural variant calling: the long and the short of it. Genome Biol. 2019 Nov 20; 20(1): 246. PubMed Abstract | Publisher Full Text | Free Full Text
3. Li Y, Roberts ND, Wala JA, et al.: Patterns of somatic structural variation in human cancer genomes. Nature. 2020 Feb; 578(7793): 112–121. PubMed Abstract | Publisher Full Text | Free Full Text
4. Thibodeau ML, O’Neill K, Dixon K, et al.: Improved structural variant interpretation for hereditary cancer susceptibility using long-read sequencing. Genet. Med. 2020 Nov; 22(11): 1892–1897. PubMed Abstract | Publisher Full Text | Free Full Text
5. D’haene E, Vergult S: Interpreting the impact of noncoding structural variation in neurodevelopmental disorders. Genet. Med. 2021 Jan; 23(1): 34–46. PubMed Abstract | Publisher Full Text | Free Full Text
6. Jun G, English AC, Metcalf GA, et al.: Structural variation across 138,134 samples in the TOPMed consortium. bioRxiv. 2023. Publisher Full Text
7. Quinlan AR, Hall IM: Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 2012 Jan; 28(1): 43–53. PubMed Abstract | Publisher Full Text | Free Full Text
8. van Belzen IAEM , Schönhuth A, Kemmeren P, et al.: Structural variant detection in cancer genomes: computational challenges and perspectives for precision oncology. NPJ Precis. Oncol. 2021 Mar 2; 5(1): 15. PubMed Abstract | Publisher Full Text | Free Full Text
9. Wenger AM, Peluso P, Rowell WJ, et al.: Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 2019 Oct; 37(10): 1155–1162. PubMed Abstract | Publisher Full Text | Free Full Text
10. Jain M, Koren S, Miga KH, et al.: Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat. Biotechnol. 2018 Apr; 36(4): 338–345. PubMed Abstract | Publisher Full Text | Free Full Text
11. Lam ET, Hastie A, Lin C, et al.: Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat. Biotechnol. 2012 Aug; 30(8): 771–776. PubMed Abstract | Publisher Full Text | Free Full Text
12. Tsang HF, Xue VW, Koh SP, et al.: NanoString, a novel digital color-coded barcode technology: current and future applications in molecular diagnostics. Expert. Rev. Mol. Diagn. 2017 Jan; 17(1): 95–103. PubMed Abstract | Publisher Full Text
13. Huang KK, Huang J, Wu JKL, et al.: Long-read transcriptome sequencing reveals abundant promoter diversity in distinct molecular subtypes of gastric cancer. Genome Biol. 2021 Jan 22; 22(1): 44. PubMed Abstract | Publisher Full Text | Free Full Text
14. Cherry JD, Esnault CD, Baucom ZH, et al.: Tau isoforms are differentially expressed across the hippocampus in chronic traumatic encephalopathy and Alzheimer’s disease. Acta Neuropathol. Commun. 2021 May 12; 9(1): 86. PubMed Abstract | Publisher Full Text | Free Full Text
15. Byrne A, Cole C, Volden R, et al.: Realizing the potential of full-length transcriptome sequencing. Philos. Trans. R. Soc. Lond. Ser. B Biol. Sci. 2019 Nov 25; 374(1786): 20190097. PubMed Abstract | Publisher Full Text | Free Full Text
16. Hu T, Chitnis N, Monos D, et al.: Next-generation sequencing technologies: An overview. Hum. Immunol. 2021 Nov; 82(11): 801–811. Publisher Full Text
17. Chaisson MJP, Sanders AD, Zhao X, et al.: Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 2019 Apr 16; 10(1): 1784. PubMed Abstract | Publisher Full Text | Free Full Text
18. Pertea G, Pertea M: GFF Utilities: GffRead and GffCompare [version 2; peer review: 3 approved]. F1000Res. 2020; 9: 304. Publisher Full Text
19. Dainat J: AGAT: Another Gff Analysis Toolkit to handle annotations in any GTF/GFF format (v0. 8.0). Zenodo. 2021.
20. Quinlan AR, Hall IM: BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15; 26(6): 841–842. PubMed Abstract | Publisher Full Text | Free Full Text
21. Zook JM, Catoe D, McDaniel J, et al.: Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data. 2016 Jun 7; 3: 160025. PubMed Abstract | Publisher Full Text | Free Full Text
22. Sariya S, Lee JH, Mayeux R, et al.: Rare variants imputation in admixed populations: comparison across reference panels and bioinformatics tools. Front. Genet. 2019 Apr 3; 10: 239. PubMed Abstract | Publisher Full Text | Free Full Text
23. Miller CR, Lee K, Pfau RB, et al.: Disease-associated mosaic variation in clinical exome sequencing: a two-year pediatric tertiary care experience. Cold Spring Harb. Mol. Case Stud. 2020 Jun; 6: a005231. PubMed Abstract | Publisher Full Text | Free Full Text
24. Yang X, Xu X, Breuss MW, et al.: Control-independent mosaic single nucleotide variant detection with DeepMosaic. Nat. Biotechnol. 2023 Jun; 41(6): 870–877. PubMed Abstract | Publisher Full Text | Free Full Text
25. Benjamin D, Sato T, Cibulskis K, et al.: Calling somatic SNVs and indels with Mutect2. bioRxiv. 2019. Publisher Full Text
26. Poplin R, Chang PC, Alexander D, et al.: A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 2018 Nov; 36(10): 983–987. PubMed Abstract | Publisher Full Text
27. Colquhoun RM, Hall MB, Lima L, et al.: Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biol. 2021 Sep 14; 22(1): 267. PubMed Abstract | Publisher Full Text | Free Full Text
28. Rakocevic G, Semenyuk V, Lee WP, et al.: Fast and accurate genomic analyses using genome graphs. Nat. Genet. 2019 Feb; 51(2): 354–362. PubMed Abstract | Publisher Full Text
29. Botelho J, Grosso F, Peixe L: Antibiotic resistance in Pseudomonas aeruginosa - Mechanisms, epidemiology and evolution. Drug Resist. Updat. 2019 May; 44: 100640. PubMed Abstract | Publisher Full Text
30. Wiehlmann L, Wagner G, Cramer N, et al.: Population structure of Pseudomonas aeruginosa. Proc. Natl. Acad. Sci. USA. 2007 May 8; 104(19): 8101–8106. PubMed Abstract | Publisher Full Text | Free Full Text
31. Belyeu JR, Brand H, Wang H, et al.: De novo structural mutation rates and gamete-of-origin biases revealed through genome sequencing of 2,396 families. Am. J. Hum. Genet. 2021 Apr 1; 108(4): 597–607. PubMed Abstract | Publisher Full Text | Free Full Text
32. Sebat J, Lakshmi B, Malhotra D, et al.: Strong association of de novo copy number mutations with autism. Science. 2007 Apr 20; 316(5823): 445–449. PubMed Abstract | Publisher Full Text | Free Full Text
33. Brandler WM, Antaki D, Gujral M, et al.: Frequency and complexity of de novo structural mutation in autism. Am. J. Hum. Genet. 2016 Apr 7; 98(4): 667–679. PubMed Abstract | Publisher Full Text | Free Full Text
34. Chiu CY, Su SC, Fan WL, et al.: Whole-genome sequencing of a family with hereditary pulmonary alveolar proteinosis identifies a rare SV involving CSF2RA/CRLF2/IL3RA gene disruption. Sci. Rep. 2017 Feb 24; 7: 43469. PubMed Abstract | Publisher Full Text | Free Full Text
35. Qiang W, Yau WM, Lu JX, et al.: Structural variation in amyloid-β fibrils from Alzheimer’s disease clinical subtypes. Nature. 2017 Jan 12; 541(7636): 217–221. PubMed Abstract | Publisher Full Text | Free Full Text
36. Zook JM, Hansen NF, Olson ND, et al.: A robust benchmark for detection of germline large deletions and insertions. Nat. Biotechnol. 2020 Nov; 38(11): 1347–1355. PubMed Abstract | Publisher Full Text | Free Full Text
37. Parikh H, Mohiyuddin M, Lam HYK, et al.: svclassify: a method to establish benchmark SV calls. BMC Genomics. 2016 Jan 16; 17: 64. PubMed Abstract | Publisher Full Text | Free Full Text
38. Ebert P, Audano PA, Zhu Q, et al.: Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021 Apr 2; 372(6537). PubMed Abstract | Publisher Full Text | Free Full Text
39. Redon R, Ishikawa S, Fitch KR, et al.: Global variation in copy number in the human genome. Nature. 2006 Nov 23; 444(7118): 444–454. PubMed Abstract | Publisher Full Text | Free Full Text
40. Swanson-Wagner RA, Eichten SR, Kumari S, et al.: Pervasive gene content variation and copy number variation in maize and its undomesticated progenitor. Genome Res. 2010 Dec; 20(12): 1689–1699. PubMed Abstract | Publisher Full Text | Free Full Text
41. Bridges CB: The bar “gene” a duplication. Science. 1936 Feb 28; 83(2148): 210–211. Publisher Full Text
42. Aouiche C, Shang X, Chen B: Copy number variation related disease genes. Quant Biol. 2018 Jun; 6(2): 99–112. Publisher Full Text
43. Zhao M, Wang Q, Wang Q, et al.: Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013 Sep 13; 14(11): 1–6. Publisher Full Text
44. Collins RL, Brand H, Karczewski KJ, et al.: A structural variation reference for medical and population genetics. Nature. 2020 May; 581(7809): 444–451. PubMed Abstract | Publisher Full Text | Free Full Text
45. Kosugi S, Momozawa Y, Liu X, et al.: Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 2019 Jun 3; 20(1): 117. PubMed Abstract | Publisher Full Text | Free Full Text
46. Cleal K, Baird DM: Dysgu: efficient SV calling using short or long reads. Nucleic Acids Res. 2022 May 20; 50(9): e53. PubMed Abstract | Publisher Full Text | Free Full Text
47. Weisenfeld NI, Yin S, Sharpe T, et al.: Comprehensive variation discovery in single human genomes. Nat. Genet. 2014 Dec; 46(12): 1350–1355. PubMed Abstract | Publisher Full Text | Free Full Text
48. Wala JA, Bandopadhayay P, Greenwald NF, et al.: SvABA: genome-wide detection of SVs and indels by local assembly. Genome Res. 2018 Apr; 28(4): 581–591. PubMed Abstract | Publisher Full Text | Free Full Text
49. Khorsand P, Hormozdiari F: Nebula: ultra-efficient mapping-free SV genotyper. Nucleic Acids Res. 2021 May 7; 49(8): e47. PubMed Abstract | Publisher Full Text | Free Full Text
50. Choo ZN, Behr JM, Deshpande A, et al.: Most large SVs in cancer genomes can be detected without long reads. Nat. Genet. 2023 Dec; 55(12): 2139–2148. PubMed Abstract | Publisher Full Text | Free Full Text
51. Scacheri CA, Scacheri PC: Mutations in the noncoding genome. Curr. Opin. Pediatr. 2015 Dec; 27(6): 659–664. PubMed Abstract | Publisher Full Text | Free Full Text
52. Li H: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018 Sep 15; 34(18): 3094–3100. PubMed Abstract | Publisher Full Text | Free Full Text
53. Tardaguila M, de la Fuente L , Marti C, et al.: SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 2018 Feb 9; 28(3): 396–411. PubMed Abstract | Publisher Full Text | Free Full Text
54. Danecek P, Bonfield JK, Liddle J, et al.: Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb 16; 10(2): giab008. PubMed Abstract | Publisher Full Text | Free Full Text
55. Chen S, Krusche P, Dolzhenko E, et al.: Paragraph: a graph-based SV genotyper for short-read sequence data. Genome Biol. 2019 Dec 19; 20(1): 291. PubMed Abstract | Publisher Full Text | Free Full Text
56. Garrison E, Guarracino A, Heumos S, et al.: Building pangenome graphs. bioRxiv. 2023. Publisher Full Text
57. Rautiainen M, Marschall T: GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020 Sep 24; 21(1): 253. PubMed Abstract | Publisher Full Text | Free Full Text
58. Leinonen R, Sugawara H, Shumway M: International Nucleotide Sequence Database Collaboration. The sequence read archive. Nucleic Acids Res. 2011 Jan; 39(Database issue): D19–D21. PubMed Abstract | Publisher Full Text | Free Full Text
59. Li H, Durbin R: Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009 May 18; 25(14): 1754–1760. PubMed Abstract | Publisher Full Text | Free Full Text
60. Garrison E, Sirén J, Novak AM, et al.: Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol. 2018 Oct; 36(9): 875–879. PubMed Abstract | Publisher Full Text | Free Full Text
61. Jeffares DC, Jolly C, Hoti M, et al.: Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat. Commun. 2017 Jan 24; 8: 14061. PubMed Abstract | Publisher Full Text | Free Full Text
62. Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1; 27(21): 2987–2993. PubMed Abstract | Publisher Full Text | Free Full Text
63. Belyeu JR, Chowdhury M, Brown J, et al.: Samplot: a platform for SV visual validation and automated filtering. Genome Biol. 2021 May 25; 22(1): 161. PubMed Abstract | Publisher Full Text | Free Full Text
64. Cheng H, Concepcion GT, Feng X, et al.: Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 2021 Feb; 18(2): 170–175. PubMed Abstract | Publisher Full Text | Free Full Text
65. Noé L, Kucherov G: YASS: enhancing the sensitivity of DNA similarity search. Nucleic Acids Res. 2005 Jul 1; 33(Web Server issue): W540–W543. PubMed Abstract | Publisher Full Text | Free Full Text
66. Chen S, Zhou Y, Chen Y, et al.: fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1; 34(17): i884–i890. PubMed Abstract | Publisher Full Text | Free Full Text
67. Li H: seqtk Toolkit for processing sequences in FASTA/Q formats.Reference Source2012.
68. Zhou Y, Zhang Z, Bao Z, et al.: Graph pangenome captures missing heritability and empowers tomato breeding. Nature. 2022 Jun; 606(7914): 527–534. PubMed Abstract | Publisher Full Text | Free Full Text
69. McHale L, Tan X, Koehl P, et al.: Plant NBS-LRR proteins: adaptable guards. Genome Biol. 2006 Apr 26; 7(4): 212. PubMed Abstract | Publisher Full Text | Free Full Text
70. Cillo F, Palukaitis P: Transgenic Resistance. Adv. Virus Res. 2014 Jan; 90: 35–146. Publisher Full Text
71. Meyers BC, Kozik A, Griego A, et al.: Genome-wide analysis of NBS-LRR-encoding genes in Arabidopsis. Plant Cell. 2003 Apr; 15(4): 809–834. PubMed Abstract | Publisher Full Text | Free Full Text
72. Yang S, Li J, Zhang X, et al.: Rapidly evolving R genes in diverse grass species confer resistance to rice blast disease. Proc. Natl. Acad. Sci. USA. 2013 Nov 12; 110(46): 18572–18577. PubMed Abstract | Publisher Full Text | Free Full Text
73. Layer RM, Chiang C, Quinlan AR, et al.: LUMPY: a probabilistic framework for SV discovery. Genome Biol. 2014 Jun 26; 15(6): R84. PubMed Abstract | Publisher Full Text | Free Full Text
74. Rausch T, Zichner T, Schlattl A, et al.: DELLY: SV discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15; 28(18): i333–i339. PubMed Abstract | Publisher Full Text | Free Full Text
75. Chen X, Schulz-Trieglaff O, Shaw R, et al.: Manta: rapid detection of SVs and indels for germline and cancer sequencing applications. Bioinformatics. 2016 Apr 15; 32(8): 1220–1222. PubMed Abstract | Publisher Full Text
76. Chen K, Wallis JW, McLellan MD, et al.: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat. Methods. 2009 Sep; 6(9): 677–681.
77. Abyzov A, Li S, Kim DR, et al.: Analysis of deletion breakpoints from 1,092 humans reveals details of mutation mechanisms. Nat. Commun. 2015 Jun 1; 6: 7256. PubMed Abstract | Publisher Full Text | Free Full Text
78. Abyzov A, Urban AE, Snyder M, et al.: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011 Jun; 21(6): 974–984. PubMed Abstract | Publisher Full Text | Free Full Text
79. Zarate S, Carroll A, Mahmoud M, et al.: Parliament2: Accurate SV calling at scale. Gigascience. 2020 Dec 21; 9(12): giaa145. PubMed Abstract | Publisher Full Text | Free Full Text
80. Popic V, Rohlicek C, Cunial F, et al.: Cue: a deep-learning framework for SV discovery and genotyping. Nat. Methods. 2023 Apr; 20(4): 559–568. PubMed Abstract | Publisher Full Text | Free Full Text
81. English AC, Menon VK, Gibbs RA, et al.: Truvari: refined SV comparison preserves allelic diversity. Genome Biol. 2022 Dec 27; 23(1): 271. PubMed Abstract | Publisher Full Text | Free Full Text
82. Pedersen BS, Quinlan AR: Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics. 2018 Mar 1; 34(5): 867–868. PubMed Abstract | Publisher Full Text | Free Full Text
83. Marcais G, Kingsford C: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011 Mar; 27(6): 764–770. PubMed Abstract | Publisher Full Text | Free Full Text
84. Camacho C, Coulouris G, Avagyan V, et al.: BLAST+: architecture and applications. BMC Bioinformatics. 2009 Dec; 10: 421. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 27 Jun 2024