Sixth Annual BCM Hackathon on Structural Variation and Pangenomics

Farhang Jaryani; Bishnu Adhikar; Shaghayegh Beheshti; Sarah Fross; Jędrzej Kubica; Jen-Yu Wang; Aanuoluwa Adekoya; Daniel P. Agustinho; Oluwaseun Akinsulire; Francesco Andreace; Abolhassan Bahari; Christian Brueffer; Siyuan Cheng; Jonah Cullen; Kristen Curry; Ryan Doughty; Adam English; Neda Ghohabi Esfahani; Natali Gulbahce; Tina Han; Nha Van Huynh; Michal Izydorczyk; Minal Jamsandekar; Emrah Kacar; Arthur Shem Kasambula; Rupesh K. Kesharwani; Divya Kalra; Shwetha V Kumar; Iva Kotásková; Callum MacPhillamy; Sina Majidian; Mauricio Moldes; Abraham (Jon) Moller; Rajarshi Mondal; Eleni Mourouzidou; Michael Nute; Dmitrii Olisov; Anika Pallapothu; Meghana Ram; Marcus Chan Hua Rui; Philippe Sanio; Russel T. Santos; Michael Olufemi; Narges SangaraniPour; Moustafa Shokrof; Sam Stroupe; Gobikrishnan Subramaniam; Todd J. Treangen; Pankhuri Wanjari; Umran Yaman; Farha zain; Xinchang Zheng; Fritz J Sedlazeck; Ben Busby

doi:10.12688/f1000research.170665.1

Home Browse Sixth Annual BCM Hackathon on Structural Variation and Pangenomics

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Sixth Annual BCM Hackathon on Structural Variation and Pangenomics

[version 1; peer review: 1 approved with reservations]

Farhang Jaryani ^1,2, Bishnu Adhikar³, Shaghayegh Beheshti ⁴, [...] Sarah Fross ⁵, Jędrzej Kubica ^6,7, Jen-Yu Wang⁸, Aanuoluwa Adekoya⁹, Daniel P. Agustinho¹⁰, Oluwaseun Akinsulire¹¹, Francesco Andreace¹², Abolhassan Bahari¹³, Christian Brueffer¹⁴, Siyuan Cheng¹⁰, Jonah Cullen¹⁵, Kristen Curry¹², Ryan Doughty¹⁶, Adam English¹⁷, Neda Ghohabi Esfahani¹⁸, Natali Gulbahce¹⁹, Tina Han²⁰, Nha Van Huynh²¹, Michal Izydorczyk¹⁰, Minal Jamsandekar⁴, Emrah Kacar²², Arthur Shem Kasambula²³, Rupesh K. Kesharwani¹⁰, Divya Kalra¹⁰, Shwetha V Kumar²⁴, Iva Kotásková²⁵, Callum MacPhillamy²⁶, Sina Majidian²⁷, Mauricio Moldes²⁸, Abraham (Jon) Moller²⁹, Rajarshi Mondal³⁰, Eleni Mourouzidou³¹, Michael Nute¹⁶, Dmitrii Olisov³², Anika Pallapothu³³, Meghana Ram³⁴, Marcus Chan Hua Rui³⁵, Philippe Sanio^10,36, Russel T. Santos³⁷, Michael Olufemi³⁸, Narges SangaraniPour³⁹, Moustafa Shokrof⁴⁰, Sam Stroupe⁵, Gobikrishnan Subramaniam⁴¹, Todd J. Treangen^16,42,43, Pankhuri Wanjari⁴⁴, Umran Yaman⁴⁵, Farha zain⁴⁶, Xinchang Zheng¹⁰, Fritz J Sedlazeck ^4,10,16, Ben Busby⁴⁷

Farhang Jaryani ^1,2, Bishnu Adhikar³, [...] Shaghayegh Beheshti ⁴, Sarah Fross ⁵, Jędrzej Kubica ^6,7, Jen-Yu Wang⁸, Aanuoluwa Adekoya⁹, Daniel P. Agustinho¹⁰, Oluwaseun Akinsulire¹¹, Francesco Andreace¹², Abolhassan Bahari¹³, Christian Brueffer¹⁴, Siyuan Cheng¹⁰, Jonah Cullen¹⁵, Kristen Curry¹², Ryan Doughty¹⁶, Adam English¹⁷, Neda Ghohabi Esfahani¹⁸, Natali Gulbahce¹⁹, Tina Han²⁰, Nha Van Huynh²¹, Michal Izydorczyk¹⁰, Minal Jamsandekar⁴, Emrah Kacar²², Arthur Shem Kasambula²³, Rupesh K. Kesharwani¹⁰, Divya Kalra¹⁰, Shwetha V Kumar²⁴, Iva Kotásková²⁵, Callum MacPhillamy²⁶, Sina Majidian²⁷, Mauricio Moldes²⁸, Abraham (Jon) Moller²⁹, Rajarshi Mondal³⁰, Eleni Mourouzidou³¹, Michael Nute¹⁶, Dmitrii Olisov³², Anika Pallapothu³³, Meghana Ram³⁴, Marcus Chan Hua Rui³⁵, Philippe Sanio^10,36, Russel T. Santos³⁷, Michael Olufemi³⁸, Narges SangaraniPour³⁹, Moustafa Shokrof⁴⁰, Sam Stroupe⁵, Gobikrishnan Subramaniam⁴¹, Todd J. Treangen^16,42,43, Pankhuri Wanjari⁴⁴, Umran Yaman⁴⁵, Farha zain⁴⁶, Xinchang Zheng¹⁰, Fritz J Sedlazeck ^4,10,16, Ben Busby⁴⁷

PUBLISHED 07 Nov 2025

Author details Author details

¹ Baylor College of Medicine Department of Pediatrics, Houston, Texas, 77030, USA
² Cancer and Hematology Center, Texas Children’s Hospital, Houston, TX, 77030, USA
³ Department of Biological Sciences, University of Alabama, Tuscaloosa, 35401, USA
⁴ Baylor College of Medicine Department of Molecular and Human Genetics, Houston, Texas, USA
⁵ Department of Veterinary Pathobiology, Texas A&M University College of Veterinary Medicine and Biomedical Sciences, College Station, 77840, USA
⁶ Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
⁷ Univ. Grenoble Alpes, CNRS, UMR 5525, TIMC / MAGe, 38000, Grenoble, France
⁸ University of California-Irvine, Department of Ecology and Evolutionary Biology, Irvine, California, USA
⁹ The University of Tennessee Knoxville Department of Microbiology, Knoxville, Tennessee, USA
¹⁰ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
¹¹ Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA
¹² Institut Pasteur, Université Paris Cité, Sequence Bioinformatics unit, Paris, 75015, France
¹³ High Institute for Research and Education in Transfusion Medicine, Tehran, Tehran Province, Iran
¹⁴ Department of Clinical Sciences, Lund University, Lund, Sweden
¹⁵ Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, St. Paul, USA
¹⁶ Rice University Department of Computer Science, Houston, Texas, USA
¹⁷ Baylor College of Medicine, Houston, Texas, USA
¹⁸ Department of Bioengineering, Northeastern University, 360 Huntington Ave, Boston, MA, MA, 02115, USA
¹⁹ CareDx, 8000 Marina Blvd, Brisbane, CA, 94005, USA
²⁰ Twist Bioscience, South San Francisco, CA, 94080, USA
²¹ The University of Alabama at Birmingham Division of Nephrology, Birmingham, Alabama, USA
²² Complex Trait Genomics Laboratory, Smurfit Institute of Genetics, Trinity College Dublin, Dublin, Ireland
²³ Incident Management Team, Ministry of Health, Uganda, Uganda
²⁴ Section of Epidemiology and Population Sciences,Baylor College of Medicine, Houston, USA
²⁵ DataSentics, Prague, Czech Republic
²⁶ Davies Livestock Research Centre, University of Adelaide, Roseworthy, SA, Australia
²⁷ Johns Hopkins University Department of Computer Science, Baltimore, Maryland, USA
²⁸ Centre for Genomic Regulation (CRG) , C/ del Dr. Aiguader, 88, 08003, Barcelona, Spain
²⁹ Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging (NIA), National Institutes of Health (NIH), Bethesda, Maryland 20892, Bethesda, Maryland, 20892, USA
³⁰ pondicherry university department of bioinformatics, pondicherry, India
³¹ Department of Medicine, University of Crete, Crete, Greece
³² Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
³³ Novaltech, R&D Division, USA, USA
³⁴ Icahn School of Medicine at Mount Sinai Department of Medicine, New York, New York, USA
³⁵ Home Team Science and Technology Agency, Singapore, Singapore
³⁶ Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA
³⁷ Museum of Natural History - University of the Philippines Los Baños, Los Baños, Philippines
³⁸ University of Massachusetts Lowell, Lowell, Massachusetts, USA
³⁹ Shahid Beheshti University of Medical Sciences, Tehran, Tehran Province, Iran
⁴⁰ Oxford Nanopore technologies, Oxford, UK
⁴¹ The Patrick G Johnston Centre for Cancer Research, Queen’s University Belfast, Belfast, UK
⁴² Ken Kennedy Institute, Rice University, Houston, TX, USA
⁴³ Rice University Department of Bioengineering, Houston, Texas, USA
⁴⁴ University of Chicago Department of Pathology, Chicago, Illinois, USA
⁴⁵ UK Dementia Research Institute, London, England, UK
⁴⁶ Department of Biotechnology and Genetic Engineering, University of Ain-shams, Cairo, Egypt
⁴⁷ DNAnexus, Inc Mountain View, CA, 94040, USA

Farhang Jaryani
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Bishnu Adhikar
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Shaghayegh Beheshti
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Sarah Fross
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jędrzej Kubica
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jen-Yu Wang
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Aanuoluwa Adekoya
Roles: Software

Daniel P. Agustinho
Roles: Software

Oluwaseun Akinsulire
Roles: Software

Francesco Andreace
Roles: Software

Abolhassan Bahari
Roles: Software

Christian Brueffer
Roles: Software

Siyuan Cheng
Roles: Software

Jonah Cullen
Roles: Software

Kristen Curry
Roles: Software

Ryan Doughty
Roles: Software

Adam English
Roles: Software

Neda Ghohabi Esfahani
Roles: Software

Natali Gulbahce
Roles: Software

Tina Han
Roles: Software

Nha Van Huynh
Roles: Software

Michal Izydorczyk
Roles: Software

Minal Jamsandekar
Roles: Software

Emrah Kacar
Roles: Software

Arthur Shem Kasambula
Roles: Software

Rupesh K. Kesharwani
Roles: Software

Divya Kalra
Roles: Software

Shwetha V Kumar
Roles: Software

Iva Kotásková
Roles: Software

Callum MacPhillamy
Roles: Software

Sina Majidian
Roles: Software

Mauricio Moldes
Roles: Software

Abraham (Jon) Moller
Roles: Software

Rajarshi Mondal
Roles: Software

Eleni Mourouzidou
Roles: Software

Michael Nute
Roles: Software

Dmitrii Olisov
Roles: Software

Anika Pallapothu
Roles: Software

Meghana Ram
Roles: Software

Marcus Chan Hua Rui
Roles: Software

Philippe Sanio
Roles: Software

Russel T. Santos
Roles: Software

Michael Olufemi
Roles: Software

Narges SangaraniPour
Roles: Software

Moustafa Shokrof
Roles: Software

Sam Stroupe
Roles: Software

Gobikrishnan Subramaniam
Roles: Software

Todd J. Treangen
Roles: Software, Supervision

Pankhuri Wanjari
Roles: Software

Umran Yaman
Roles: Software

Farha zain
Roles: Software

Xinchang Zheng
Roles: Software

Fritz J Sedlazeck
Roles: Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Ben Busby
Roles: Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Hackathons collection.

Abstract

Background

Structural variants (SVs) and metagenomics remain challenging areas in genomics, requiring new tools and collaborative solutions. Hackathons provide a rapid, team-based approach to prototyping and innovation.

Methods

In August 2024, 48 scientists from six continents convened at Baylor College of Medicine for the Sixth Structural Variant Codeathon. Participants worked in interdisciplinary teams over three days, using public datasets and cloud-based infrastructure to design and implement computational tools.

Results

Eight projects were developed, addressing topics such as tandem repeat annotation, structural variant discovery, benchmarking, pangenome visualization, and machine learning applications. Each project produced open-source software, with repositories openly available on GitHub and archived on Zenodo.

Conclusions

The hackathon fostered global collaboration and generated reproducible, community-driven tools. These outputs provide new resources for structural variation and metagenomics research and demonstrate the effectiveness of hackathons in advancing genomic science.

Keywords

Population frequency, Structural variants, Mosaicism, Cancer, LLM, Metagenome, tandem repeats, haplotype structure, ancestral recombination graphs

Corresponding authors: Farhang Jaryani, Bishnu Adhikar, Shaghayegh Beheshti, Sarah Fross, Jędrzej Kubica, Jen-Yu Wang, Fritz J Sedlazeck, Ben Busby

Competing interests: This article reflects the views of the author and should not be construed to represent FDA's views or policies" BB is a full time employee of DNAnexus, Inc. FS is sponsored by Illumina, PacBio, ONT

Grant information: Shwetha V Kumar is supported by CPRIT grant #RP210037 (PI Aaron Thrift)
Sedlazeck NIH grant: 1UG3NS132105-01, 1U01HG011758-01

The research was supported by Cancer Prevention and Research Institute of Texas under grant number: #RP210037

Sarah Fross is supported by a training grant from the National Institutes of Health under Award Number 5T32GM135748-04.

Shaghayegh Beheshti is supported by a training grant from the National Institutes of Health under Award Number 5T32GM139534-04 and NHGRI U01 HG011758.

Ryan Doughty is supported by a training fellowship from the Gulf Coast Consortia, on the NLM Training Program in Biomedical Informatics & Data Science (T15LM007093)

Jędrzej Kubica is supported by the Ministry of Science and Higher Education (Poland) as a project under the program Excellence Initiative – Research University (2020–2026) (decision no.: IV.2.3./30/2024) and the France 2030 state funding managed by the National Research Agency with the reference "ANR-22-PEPRSN-0013".

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Jaryani F et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Jaryani F, Adhikar B, Beheshti S et al. Sixth Annual BCM Hackathon on Structural Variation and Pangenomics [version 1; peer review: 1 approved with reservations]. F1000Research 2025, 14:1231 (https://doi.org/10.12688/f1000research.170665.1) First published: 07 Nov 2025, 14:1231 (https://doi.org/10.12688/f1000research.170665.1) Latest published: 07 Nov 2025, 14:1231 (https://doi.org/10.12688/f1000research.170665.1)

Introduction

Baylor College of Medicine hosted the sixth annual structural variant and pangenomics Hackathon on August 28th-30th, 2024. We reported the results of some of the previous hackathons as articles (Deb et al. 2024; Walker et al. 2022; Mc Cartney et al. 2021).

1. Tandem Repeats

Tandem repeats (TRs) are DNA sequences consisting of two or more bases repeated multiple times in a head-to-tail pattern along a chromosome (Levinson 2019). Typically found in non-coding regions, TRs play significant roles in genetic variation and are implicated in various diseases (Depienne and Mandel 2021). They also serve as powerful tools in DNA fingerprinting for forensic analysis (Butler 2006).

TR subtypes are classified based on the length of the repeated motif—short tandem repeats (STRs) range from 2 to 6 base pairs (Butler 2006), while variable number of tandem repeats (VNTRs) span 7 to 100 base pairs (Bakhtiari et al. 2021). Additionally, TRs can be categorized by their genomic context or function, such as alpha satellite repeats in centromeres (McNulty and Sullivan 2018; A. English et al. 2023) or rDNA repeats (Kobayashi 2014). Despite their importance, tandem repeats are challenging to analyze.

Our project aims to leverage the existing Tandem Repeat Database and Analysis Queries tool (tdb). This tool turns ‘REPL’ style VCFs from tandem repeat (TR) callers into a database. This database is in parquet format, compressed and well-structured and easily parsed as compared to VCFs. There are currently a handful of ‘standard’ queries and analysis notebooks which can provide useful summaries of tandem repeat results. For the Sixth Annual Structural Variant and Pangenomics Hackathon, we aimed to introduce some new and interesting queries.

2. Simulation of mosaic variants

Mosaic variants are genetic mutations that affect only a subset of an individual’s cells rather than all of them (Jiang et al. 2019). This mutation occurs after fertilization and during early development, resulting in a mosaic pattern in which some cells carry the genetic change while others do not. Mosaic variants can affect multiple tissues and produce a wide range of phenotypes, depending on when and where the mutation occurs during development (Biesecker and Spinner 2013).

Mosaic variations help explain the genetic risk of adult diseases. It’s vital to understand their normal, non-pathogenic incidence and mutation rates (Costantino, Nicodemus, and Chun 2021). To do this, we primarily use variant detection approaches across sequencing platforms. We customize some approaches for specific conditions, which complicates the evaluation of their accuracy and false positive rates. We created a modeling framework to mimic mosaic mutations at varied variant allele frequency (VAF) rates, including substitutions, indels, and structural variations. Our project builds on the group project from last year, ensuring that it is suitable for long-read sequence files. The identification of mosaic mutations is often based on the analysis of VAF, which reflects the proportion of sequencing reads that have a particular variant. Detecting variants with low VAF can be difficult as they may be masked by sequencing errors or only occur in a small number of cells. This becomes particularly important when studying complex diseases or conditions where a subtle mosaicism might influence disease onset or progression.

3. AMRDiscovery: Analyzing antimicrobial resistance genes in NCBI sequence read archive

Antimicrobial resistance (AMR) is a growing global health concern, driven by the overuse and misuse of antibiotics (Sugden, Kelly, and Davies 2016). Detecting and monitoring the presence of AMR genes in various environments is crucial for understanding the spread of resistance and informing public health strategies. Matching metagenomes with AMR genes makes it easy to survey a wide range of samples at once, when compared with wet lab methods. This has been a popular field so multiple tools and databases were published. However, we still have some unexplored areas regarding databases.

Finding AMR genes usually involves two data components, a database of known AMR genes and the database which they search. The search is fulfilled by alignment tools, including DIAMOND (Buchfink, Xie, and Huson 2015), BLAST+, HMMER or minimap2 (Li 2018). For example, AMRFinder and ARGminer are two approaches with specific purposes. These tools usually accompany curated standard databases of AMR genes. For example, the Comprehensive Antibiotic Resistance Database (CARD) is a systematically maintained database that combines the Antibiotic Resistance Ontology (ARO) with well curated AMR gene sequences and mutations (Alcock et al. 2023). The database offers a methodical approach to categorize and comprehend resistances, using separate files for each model type, FASTA data, and ARO tags linked to GenBank accessions, and also offers cross-references for primary categories within the ARO, such as the AMR gene family, target drug class, and resistance mechanism.

Regarding genomic and metagenomic sequences, European Nucleotide Archive (ENA) Repository and National Center for Biotechnology Information (NCBI) GenBank were common modern choices. In history, the quantity of publicly available genomic and metagenomic sequences has grown exponentially since the beginning of centralized cloud-based hosts for genetic sequencing data (Lathe et al. 2008). The Los Alamos Sequence Database was first established as a repository for annotated biological sequences in 1979, and then relocated to the National Center for Biotechnology Information (NCBI) and renamed to GenBank in 1982 (“GenBank and WGS Statistics” 2024; “National Library of Medicine” 2024; Sayers et al. 2020). This database is now part of the International Nucleotide Sequence Database (INSDC), a collaboration between NCBI, the European Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan (DDBJ), and as of August 2024, contains roughly 3.68 terabases of sequencing data (“GenBank and WGS Statistics” 2024). In 2009, INSDC additionally launched Sequence Read Archive (SRA) to host raw and unprocessed reads (Katz et al. 2022). As of December 2023, this massive dataset contains roughly 50 petabases of sequencing data from a range of eukaryotic and prokaryotic hosts, as well as environmental communities (Chikhi et al. 2024). Each set of sequences contains metadata subject to that of the uploader with information regarding the sequencing process (i.e. assay type, sequencing instrument, library layout) as well as the sampled environment (i.e. organism, sampling date, geographical location).

However, that most resourceful database, SRA, was difficult to search because of its size. In 2024, a nearly comprehensive solution, called Logan, was published on BioRxiv. The Logan database consisted of assembled contigs and unitigs, derived from a freeze of the SRA, reduced the size and redundancy of raw reads (384 terabytes vs. 50 petabases). Longan permits large-scale alignment-based search across all sequences on SRA across the Tree of Life efficiently for the first time. These constructed assemblies in conjunction with Amazon Web Services (AWS) structure permit large-scale alignment to set of query protein or nucleotide sequences, via DIAMOND (Buchfink, Reuter, and Drost 2021) or minimap2 (Li 2018) respectively, within a reasonable amount of time. Leveraging the vast quantity of unprocessed data on SRA rather than smaller annotated databases, gives potential to detect large-scale trends of gene flow around the globe across different organisms and environments.

In this research, we aligned the genes of CARD to the Logan database to identify and catalog AMR genes present in the dataset. Therefore, we could survey the prevalences, mechanisms, distributions and other important properties of antimicrobial resistance. This work will provide valuable insights into the distribution and prevalence of AMR genes across a vast range of environments and host organisms. This approach surpasses previous attempts to find AMR in SRA subsets by taking advantage of the sheer size of the Logan database, which contains all accumulated information from SRA to date, and by using contigs for alignment, which should help avoid issues with contaminations faced by raw reads.

4. Mobile elements across species

Mobile genetic elements (e.g., transposons) are capable of relocating within a genome through cut-and-paste and copy-and-paste mechanisms. Their movement can influence gene expression, exert mutagenic effects, and drive genome evolution. In humans, they are implicated in the origin of diseases (Chénais 2022). Conversely, they hold potential for use in genetic editing, particularly in the treatment of genetic disorders, thereby underscoring the importance of their identification and annotation within the genome. In fungi, transposons confer metal resistance and contribute to genome evolution. However, the identification and annotation of mobile elements present considerable challenges due to their structural diversity, which complicates genomic mapping. Additionally, their capacity for horizontal transfer between species further complicates the determination of their function.

Starfish (Gluck-Thaler and Vogan 2024) is a recently developed modular toolkit for de novo giant mobile element discovery and annotation in fungal genomes. In an effort to support the use of starfish for other species, this project aimed to bolster accessibility and usability of starfish (v1.0.0).

5. ONT metagenome simulator

Oxford Nanopore (ONT) sequencing is rapidly becoming a widely used sequencing technology in metagenomic studies due to its cost, long reads, and significantly improved error rate (Agustinho et al. 2024). However, there exists a wide heterogeneity in microbiome data due to variation in experimental designs making designing efficient computational software challenging. As long reading sequencing technology becomes popular in metagenomics, simulated datasets with known error rates can help evaluate existing and newer bioinformatic algorithms. There is a need for an easy to use metagenomic tool development to create standard truth ONT datasets in varying microbial environments that are reasonably realistic. We built MIMIC, a metagenome simulator that creates simulated ONT sequencing data by replicating the taxonomic abundances of real-world microbiome samples. In addition to providing simulated sequencing data, MIMIC also offers a simple-to-use evaluation framework for comparing the results of existing taxonomic classification methods to the known truth data, allowing for easy benchmarking across a host of different environments and error-profiles.

6. Haploblock clusters

Haplotypes are defined as sets of genomic variants that are inherited together from a single parent. In theory, the human genome consists of multiple haplotype blocks shared among individuals from all populations, however, there are differences in allele frequency between any two populations (Shipilina et al. 2023). Haplotype phasing estimates the haplotype inheritance using genotype or sequencing data and aims to capture information about which genomic variation is associated with particular complex traits and common diseases, such as cancer (Garg 2023; Sakamoto, Sereewattanawoot, and Suzuki 2019) or diabetes (Sankareswaran et al. 2024; Luo et al. 2024). By estimating haplotypes, we can infer inter- and intra-population genealogical relationships, thus enhancing our understanding of the relatedness among individuals in the population, as well as the implications of a given mutation (or variation) on health.

In recent years, global initiatives have been undertaken to determine genomic variation that underlie phenotypic similarities across different populations, such as the International HapMap Project (“The International HapMap Project” 2003). Furthermore, the increasing number of biomedical databases, such as 1000Genomes (“A Global Reference for Human Genetic Variation” 2015), Genome in a Bottle (Zook et al. 2016) or UK Biobank (“UK Biobank” 2024) provide access to large collections of genomic data which can advance the efficiency and accuracy of methods for variant phasing and genealogical analyses. However, accurately estimating haplotypes and interpreting their implications in the disease mechanisms remain challenging due to the complexity of the data and high computational cost, therefore previous approaches for haplotype analysis would make broad assumptions and rough approximations, which could lead to inaccuracies. Conversely, new approaches for inferring the association between genomic variation and complex traits, alongside a large-scale computing infrastructure offer an amazing opportunity to efficiently and accurately derive genealogical relationships, ancestry, causality and risk factors for shared phenotypic traits (Browning and Browning 2023; Hofmeister et al. 2023; Leitwein et al. 2020).

During the hackathon, we aimed to design and develop a bioinformatic analysis pipeline for the computation of similarity matrices of intra- and interpopulation haplotype blocks, which would take into account both rare and common genomic variants. Here we present a proof-of-concept bioinformatic workflow to obtain haplotype blocks and to determine correlations between sets of genomic variants and genealogical relationships. We planned to use the existing methodologies for haplotype phasing, SHAPEIT5 (Hofmeister et al. 2023), and relatedness calculation, ARG-Needle (B. C. Zhang et al. 2023), to examine how sets of genomic variants are shared across populations. We used ARG-Needle (Zhang et al., 2023) to infer genealogical relationships between two haplotype blocks: a haplotype block that overlaps with the human leukocyte antigen HLA-A gene (chr6:29631001-30180001) and a random haplotype block (chr6:594001-655001). Then we planned to use these evolutionary relationships in the form of ancestral recombination graphs (ARGs), which offer a promising direction in evolutionary research (Griffiths and Marjoram 1997; Lewanski, Grundler, and Bradburd 2023) to estimate similarities between the haplotype blocks across populations.

7. Somatic variants in cancer

Cancer is a highly heterogeneous microevolutionary state that arises from healthy cells by a series of point mutations and large DNA rearrangements. Sporadic mutagenesis gives rise to tumor subclones that have a distinct set of genomic alterations, which can promote tumor growth, metastasis or treatment resistance. In comparison to single nucleotide variants, characterisation of more complex events that contribute to intratumoral genetic heterogeneity was lacking up until recent efforts in deep whole genome sequencing of tumors and development of mutation callers. In this project we specifically focus on mosaic structural variants in cancer and have designed a tool for their functional annotation for identifying which genes and biological pathways are affected by these mosaic structural variants. Therefore, by comprehension and linking genes we can predict how a tumor might evolve over time, by extension this leading to mutation prediction which has aggressive tumor behavior or how cancer might respond to different treatments. While simple, this tool should become a stepping stone for further studies on the contribution of rare variants to emergence of treatment-resistant subclones and the recurrence of disease.

8. Rapid phenotypic labeling of variants

Structural variants (SVs) represent deviations from a reference genome sequence, typically spanning more than 50 base pairs (bps). These variations can have significant implications for understanding genetic diversity and the mechanisms underlying various phenotypes. Larger structural variants are present among human genomes. In particular, human chromosomes can have deletions of segments, duplications of segments, inverted segments, inserted segments, and/or translocated segments from other chromosomes ( Figure 1).

Figure 1. Types of structural variants (“Human Genomic Variation” 2023) (Last updated: February 1, 2023).

For example, the Charcot-Marie-Tooth disease type 1A (CMT1A) that results in nerve damage in extremities is caused by a duplication of the peripheral myelin protein 22 (PMP22) gene on human chromosome 17 (Lupski et al. 1991; Stavrou and Kleopa 2023). This condition is prevalent among at least 17 out of every 100,000 people worldwide as a result of the same SV on the PMP22 gene (Ma et al. 2023). By localizing the affected region, animal models in preclinical trials hope to completely reverse the condition through gene silencing.

This project aims to develop a robust pipeline for detecting and cataloging identical SVs across different samples and databases, ultimately linking them to specific phenotypes. The primary goal of this study is to identify and analyze SVs in novel and known genes, as well as established population SVs, to uncover new biological processes and associations. By cross-referencing SVs with phenotypic data, this pipeline seeks to establish a more comprehensive understanding of genotype-phenotype correlations.

Methods

1. Tandem repeats

Data

We used a tandem repeat database (TDB) containing 105 samples of diverse ancestries from the Human Pangenome Reference Consortium (HPRC) (Liao et al. 2023; Dolzhenko et al. 2024). The population distribution of 105 individuals from the TDB database include 52 African ancestry (AFR), 56 American ancestry (AMR), 32 East Asian ancestry (EAS), 48 South Asian ancestry (SAS), and 8 unknown ancestry (UNK). The data encompasses 937,122 tandem repeat (TR) loci spanning a total of 121,698,022 base pairs, which represents approximately 4% of the GRCh38 reference genome. Additionally we used the Adotto TR catalog (v0.3) (“Project Adotto Tandem-Repeat Regions and Annotations” 2024).

Queries

We had four queries for the completion of the hackathon project ( Figure 2).

Figure 2. Workflow for the Tandem Repeat project and the analysis of queries.

During the hackathon, four queries were completed: GTF annotation, population structure and PCA, outlier length, and TR structure.

First query was a GTF annotation. There is an established population structure notebook (https://github.com/ACEnglish/tdb/blob/develop/notebooks/PopulationStructure.ipynb) which will identify loci with >= 20 alleles and plot a clustermap of how similar samples’ alleles are. This comes with clustering in the HPRC example data which constructs the population structure. This query selects the loci which is greater or equal to 20 alleles sufficient and leverages the length of polymorphism queries to get an informative set of loci. This query also further includes samples with their clusterID which reveals more information for understanding population structure. Second query was to study population structure and PCA analysis. Though there is already an example notebook which will perform a PCA on a tdb. This query can be expanded to perform PCA on methylation data and relate population structures to its methylation data. Third query was about length outliers. We used this approach to find TR alleles which have an anomalous length and to explore length outliers. This query will help to incorporate other approaches to find length outliers. Finally, the fourth query was about the TR structure. Given the multiple TR alleles over a locus, we can annotate the TR motifs on each sequence and perform an MSA. We can then consolidate and create a ‘consensus’ structure of the repeats over the spans. This output should allow more detailed analysis of length outliers because we would no longer be just looking at the length of sequence over the locus but have motifs and copy numbers aligned across alleles. A light-weight notebook that leverages abpoa and tr-solve to build some of this information is already available. However, we want to replace tr-solve for annotating motifs. TRF is possible, but it will redundantly annotate spans which would make deconvolution of the repeat structure over multiple sequences difficult.

Implementation

We used tdb v0.2.0, which creates and analyzes genomic databases that have tandem repeat sequences. It is available through tdb github release.

Operation

We installed tdb by cloning its repository and installing it via Python. To process the data we created a tdb-compatible file from a VCF and queried allele counts, using the create and query commands. We merged tdb files using the merge command, which combines two databases with higher memory allocation. Additionally we added extra files using the merge --into option. For larger datasets containing more than ten tdb files, we used bigmerge command in order to effectively query and manage tandem repeat databases.

2. Simulation of mosaic variants

Our simulation framework models mosaic mutations at various variant allele frequency (VAF) rates, including substitutions, indels, and structural variants, using two tools: SpikeVar and TykeVar.

• SpikeVar automates the merging of two datasets at user-defined coverage or rates, verifies variant-calling mutations, and outputs a benchmarking-ready VCF file with accurate VAF annotations.
• TykeVar modifies reads within a single sample to simulate mosaic mutations while preserving haplotype structures. Altered read IDs are removed from the BAM file, aligned to the reference genome, and merged back. The final output consists of a modified BAM file and a VCF file with annotated mosaic variants.

As illustrated in Figure 3, the SpikeVar pipeline generates a BAM file containing mixed sequencing reads. These reads originate from two samples combined in user-defined ratios, simulating mosaic VAF. The resulting VCF file annotates confirmed mosaic variant locations, providing variant positions and supporting information. The TykeVarMerger refines these outputs by integrating modified reads into the dataset, resulting in a filtered BAM with the original read IDs and a VCF containing verified mosaic variant records.

Figure 3. Overview of SpikeVar and TykeVar workflows.

(A) The SpikeVar pipeline simulates mosaic variant allele frequency (VAF) by spiking mutations from one sample into another, creating a mixed dataset for variant callers. (B) TykeVar: A pipeline that inserts mosaic mutations into single-sample reads to create a modified dataset with original mosaic variations for accurate variant detection.

This integrated framework ensures reproducibility, scalability, and compatibility across sequencing datasets, facilitating robust and accurate benchmarking of mosaic variant detection methods (Deb et al., 2024).

Implementation

Our simulation framework integrates two primary tools, SpikeVar and TykeVar, to model mosaic mutations.

• SpikeVar automates the merging of datasets, verifies variant-calling mutations, and generates VCF files with accurate VAF annotations. It employs scripts such as 2b_regenotyping_main.sh, 2b_SNV.sh, 2b_SV.sh, 2b_vf_short.sh, 2b_vf_long.sh, 2b_vaf_filtering.sh, and 2b_vaf_merge.sh for distinguishing SNVs and SVs, processing sequencing data, and generating merged VCF outputs ( Figure 3(A)).
• TykeVar modifies reads in single-sample datasets, removing altered read IDs, aligning modified reads to the reference genome, and merging them back into the dataset. This results in BAM and VCF files with accurate truth sets for mosaic variants ( Figure 3(B)).

Both tools work together to create reliable datasets for benchmarking mosaic variant detection.

Operation

The following are the minimal system requirements and an overview of the workflow for running the SpikeVar and TykeVar pipelines:

System Requirements:

• Operating System: Linux (Ubuntu 20.04 or later recommended)
• Processor: Multi-core CPU (Intel Xeon or equivalent recommended)
• Memory: Minimum 64 GB RAM (128 GB recommended for larger datasets)
• Storage: At least 1 TB of free disk space

Software Dependencies:

Bash shell

Python (version ≥3.8)

SAMtools (version ≥1.10)

BCFtools (version ≥1.10)

BEDTools (version ≥2.30)

Variant callers (e.g., Mutect2, FreeBayes)

Workflow overview

SpikeVar Workflow:

1. Start with 2b_regenotyping_main.sh.
2. Process SNVs (2b_SNV.sh) and SVs (2b_SV.sh).
3. Use 2b_vf_short.sh or 2b_vf_long.sh for short or long-read processing.
4. Apply VAF filtering (2b_vaf_filtering.sh).
5. Generate the final “Merged Re-genotyped VCF” (2b_vaf_merge.sh).

TykeVar Workflow:

1. Remove altered read IDs from the original BAM file.
2. Align modified reads to the reference genome.
3. Merge modified reads into the filtered BAM.
4. Generate final BAM and VCF files with mosaic variant truth sets.

These workflows ensure reproducibility, scalability, and compatibility across sequencing datasets, facilitating accurate benchmarking of mosaic variant detection tools.

3. AMRDiscover

The prokaryotic subset of the Logan database was downloaded on 25 August 2024. Additionally, the CARD (version-3.3.0) database containing curated sequences of known AMR genes was obtained. Sequences from the CARD database were aligned to the Logan unitigs/contigs by minimap2 (Li 2018) with default parameters and the following arguments: `--sam-hit-only` and `-a`. We focused on high-confidence alignments that suggest the presence of AMR genes. Then, the results were filtered and curated using the NM tag in the SAM format, considering matches of at least 100 bases and identity of 80 bases. We benefited from the metadata of SRA accessions including location and date of samples.

We identified the number of alignment hits of AMR genes in the isolates over years from 2000 to 2024 in the United States. We visualize the results spatially using geopandas (v1.0.1) and mpl_toolkits from Matplotlib (v3.8). The workflow of the project is presented in Figure 4.

Figure 4. The workflow of the AMR discovery project.

The pipeline consists of steps for analyzing the whole SRA database using the LOGAN contigs and CARD AMR genes. The output results are reported as the number of alignment hits for a country in a certain year.

Data

The Longan unitigs are available on AWS (https://registry.opendata.aws/pasteur-logan/). The antibiotic resistant genes were downloaded from CARD website ( https://card.mcmaster.ca/download/).

Implementation

Basically, our project was composed of three parts, alignment, filtering and analysis. The alignment was done with default minimap2. After alignment, the files were downloaded to local with AMRdiscover.sh and filtered by filter_parse_script.sh. The old_alignment_parsing.sh included the parallelization of filter_parse_script.sh. The analysis was diverse because each part was completed by different members. Geographic visualization was done with Python and the scripts were stored on our Github page, “AMRdiscover/scripts/sql_Athena”. Other scripts for visualization were done with Python3.8 or R4.2. For instance, species_gene_counts_plots.ipynb and plotting_mechanism.R.

Operation

Our analysis was performed on Linux Ubuntu 20.04. Alignment and its processes required minimap2 and samtools. Key tools on Linux were “awk” and GNU “parallel”. We used Python 3.8 and the following packages: pandas, matplotlib, geopandas, numpy, mpl_toolkits.axes_grid. R packages were tidyverse, RColorBrewer and khroma.

4. Mobile elements across species

The current starship analysis (v1.0.0) requires executing seven individual bash scripts (https://github.com/egluckthaler/starfish). To simplify execution, starfishDiscovery provides a docker container.

Implementation

To use the Docker container, clone the starfish repository and from the same directory as the Dockerfile, run the following command: `docker build -t ${docker_username}/starfish --platform linux/amd64`. This will build a Docker container that includes all the software needed to run starfish. To use the container for your analysis, run `docker run -it -v ${path/to/your/data} ${docker_username}/starfish`. The -it flag enables it to interact with the container like a normal shell session and the -v flag allows docker to interact with the supplied directory on the host machine. This is important to enable access to the results after the analysis is finished.l

Operation

Our analysis was performed on Linux Ubuntu 20.04. The starfish workflow required Docker (version 20.10 or later) for containerization. Key tools within the Docker container included bash for script execution, Python 3.8 with the following packages: pandas, matplotlib, and numpy, and Snakemake (version 7.19) as the workflow runner. Additionally, the container relied on pre-installed bioinformatics tools necessary for starfish analysis. Input data and results were managed using the Docker -v flag for directory mounting.

5. ONT metagenome simulator

Long read ONT reads are steadily gaining popularity in many metagenomic studies. However, due to platform-based challenges such as high error rare and chimeric artefacts, it is therefore necessary to develop customised bioinformatic tools to effectively characterize microbial composition. We therefore have designed an easy to use workflow to create simulated ONT reads from existing metagenomic studies using ONT (Yang et al. 2017).

Implementation

The workflow implements two distinct steps: simulation and analysis. In the simulation step, the pipeline takes ONT reads from a real metagenome and taxonomically profiles the sample using Lemur and Magnet. More specifically, Lemur first generates relative abundance and taxonomic profiles using a marker gene database and the Expectation-Maximization (EM) algorithm (Sapoval et al. 2024). The profile is then fed into Magnet (“Mimic/README.md at Main · collaborativebioinformatics/Mimic” 2024), which downloads all of the reference genomes and performs competitive read-alignment in order to determine final presence/absence calls. Abundances from Lemur are then mapped to the present genomes called by Magnet to give a final set of species and abundances to use for simulation. The genomes and their abundances are then inputted into Nanosim, along with the number of desired reads to output. Nanosim outputs a simulated file in the FASTA format, as well as error profiles.

The combination of Lemur and Magnet pipelines not only improves recall and precision, but it is easy to deploy as it requires limited computational resources. Apart from simulated reads, Nanosim also generates truth tables built from the simulated reads. These tables contain both taxon labels as well as relative abundance. After simulation, Kraken2 can be run on the simulated reads and the resulting relative abundances are evaluated against the truth table, resulting in precision and recall metrics.

Operation

Mimic is openly available for use at https://github.com/collaborativebioinformatics/Mimic. Mimic has been tested on Linux-based systems and can be run by following installation instructions provided in the repository. The pipeline is implemented in Python and follows the workflow described in Figure 5.

Figure 5. The workflow of the ONT Metagenome Simulator project.

MIMIC simulates Oxford Nanopore (ONT) reads from any existing metagenomic community by 1.) Taking in an ONT FASTQ file and analyzing it with Lemur and Magnet. 2.) Simulating reads based on the Lemur and Magnet mimicked profile with Nanosim. 3.) Running kraken2/sourmash for taxonomic classification and generating truth tables are generated for simulated data based on real microbiome samples and Lemur.

6. Haploblock clusters

Implementation

In the first step, we downloaded genomic data in the VCF format from 1000Genomes (“A Global Reference for Human Genetic Variation” 2015). Initially, we planned to use VCF files for three populations (Dai Chinese (CDX), Puerto Rican from Puerto Rico (PUR) and British from England and Scotland (GBR)), however, for the purpose of the hackathon, we focused on one population - Chinese Dai in Xishuangbanna, China (CDX) (“Data Portal,” n.d.). The data of the CDX population contained 109 individual samples, the VCF files of which had already been phased with SHAPEIT2 (Delaneau, Zagury, and Marchini 2013), which facilitated our hackathon effort and allowed us to move directly to the next step without phasing haplotypes. Since we planned to use ARG-Needle in the next step, which requires HAP files as input, we used Plink2 to convert the phased VCF files to HAP files (command: `plink2 --vcf phased.vcf --export hap --out new_filename_prefix`) that we subsequently splitted into haplotype blocks, which we defined as parts of the genome between recombination hotspots using the b36 genetic map (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/genetic_map_b36.tar.gz). We acknowledge that we used the old b36 genetic map instead of a new one, however we were not able to run the software with the new hg38 genetic map (https://genome.ucsc.edu/cgi-bin/hgTables) during the hackathon. Therefore, for the purpose of testing the proof-of-concept workflow, we proceeded with the old genetic map, and as a result, we obtained 2089 haplotype blocks for chromosome 6 of the CDX population. Furthermore, we planned to use ARG-Needle (https://github.com/palamaraLab/arg-needle-lib) (Zhang et al., 2023) to infer genealogical relationships between two haplotype blocks - a haplotype block that overlaps with the human leukocyte antigen HLA-A gene (chr6:29631001-30180001) and a random haplotype block (chr6:594001-655001). For that purpose, we used the HAP files corresponding to the first haplotype block of chromosome 6 as input for ARG-Needle (`arg_needle --hap_gz CDX_chr6_HLA.hap.gz --map genetic_map_b36 --chromosome 6 --out CDX_chr6_HLA --mode sequence`). We noticed that the input data did not follow the specifications required by the software (e.g., the genetic map did not contain the required number of sites, even if lifted to the hg38 reference genome using the UCSC Genome Browser (Navarro Gonzalez et al. 2021), however, we successfully obtained the output ARG for a small fraction of the data. As a result, we produced an ARGN file and used the tskit library (https://tskit.dev/tutorials/viz.html; https://github.com/tskit-dev/tskit) in Python to convert the file into a tskit. TreeSequence object for visualization and analysis.

Operation

During the hackathon, we developed a prototype bioinformatic workflow to calculate the similarities between haplotype blocks derived from population genomic data ( Figure 6). The workflow designed and developed during the hackathon includes haplotype phasing, genealogical relationship inference and haplotype block similarity estimation.

Figure 6. Workflow of the Haploblock Clusters project.

We also noticed that the calculations required a lot of computational resources, therefore we performed all calculations on a DNAnexus Cloud Workstation (16 CPUs, 128 GB of memory, 600 GB of storage) (https://documentation.dnanexus.com/developer/cloud-workstation ).

Considering the short timeframe of the hackathon (3 days), we focused our effort only on the proof-of-concept mentioned above, however, we expect that the workflow can be further extended to include the rest of the haplotype blocks of all chromosomes from the CDX population, as well as other populations from 1000Genomes and other large-scale datasets (e.g., GIAB, UK Biobank). In addition, we aimed to build an automated and efficient DNAnexus Workflow (https://documentation.dnanexus.com/developer/workflows) that would take VCF files as input and generate a similarity matrix to compare haplotype blocks.

7. Somatic variants in cancer

The pipeline MoVana (MOsaic structural Variants ANnotation in cAncer) is designed to select mosaic events based on their allele frequency (AF), annotate them with overlapping genes and perform the gene set enrichment analysis to infer functional impact ( Figure 7).

Figure 7. Flowchart of the MoVana pipeline.

Data

Our workflow involved a publicly available dataset of SV calls from the International Cancer Genome Consortium (J. Zhang et al. 2019). The dataset contains over 71,000 reported deletions, duplications, inversions, and translocations, with the latter excluded for simplicity. To compensate for the fact that the dataset has no estimated AF values, we simulated a distribution of hypothetical AFs. Under a neutral evolutionary model most of the events in a tumor have low AF and belong to the so-called “neutral tail” of the distribution, while true clonal events cluster towards 0.5 AF of heterozygous variants (Hsieh et al. 2020).

Implementation

Our group of interest are subclonal events with lower AFs, therefore the first step of the pipeline involves filtering the dates and including only putatively mosaic events. BCFtools is used to filter out the entries above the user-specified threshold, set to 0.4 in this example. In the next step of the pipeline, known SVs breakpoints and gene coordinates are used to find overlaps with bedtools and annotate each event with the respective affected genes. An extra filtering step is required before the gene enrichment analysis, as the disruptive effect of various rearrangements on gene function depends on the type of SV. Thus, the workflow includes a filtering step based on the SV type, and its output can be submitted for the gene set enrichment analysis or search among known genes implicated in cancer.

Operation

System requirements:

➔ Operating System: Linux (Ubuntu 22.04 or later recommended)
➔ Processor: Multi-core Intel, AMD or ARM CPU
➔ Memory: Minimum 16GB RAM (32GB recommended for larger datasets)
➔ Storage: At least 4GB of free disk space
➔ Software Dependencies:
- ♦ Bash shell
- ♦ Java (version 17 or later)
- ♦ Python (version ≥3.9)
- ♦ BCFtools (version ≥1.17)
- ♦ BEDTools (version ≥2.30)
- ♦ Cromwell (version 87 or later)

Workflow overview:

• MoVana workflow
To run the workflow, execute the MoVana_Workflow.wdl file located in the WDL directory using Cromwell, as mentioned on the GitHub page.
- ○ Input data preparation
  - ▪ Start with script_1.py
  - ▪ This will generate simulated AF values for the input dataset.
- ○ Mosaic event selection
  - ▪ Filters based on SVs, using bcftools_filter_VAF_2.sh, to retain only selected mosaic events in the input file.
- ○ Subset random samples
  - ▪ Subset 1000 random samples from the VCF file, by running script_3.py
- ○ Use bedtools_intersect_genes_4.sh, to identify SVs that overlap with the genes in the sample.
- ○ get_genes_for_GSEA_5.sh outputs the specified SV type and lists the affected genes for GSEA (Gene Set enrichment analysis).
- ○ GSEA
  - ▪ Run script_GSE_6.py to GSEA on the affected gene list.

8. Rapid Phenotypic Labeling of Variants

Implementation

The project consisted of three parts that constitute a workflow shown in Figure 8:

• Population SV Detection: The pipeline will accurately identify common structural variants (SVs) across multiple datasets, ensuring consistency and reliability in detecting both known and novel variants.
• Phenotype Association: Each identified SV will be linked to phenotypic data such as ClinVar, allowing the correlation of specific genetic variations with particular traits or diseases.
• VCF File Output: The results will be condensed into an annotated variant call format (VCF) file, summarizing the detected SVs and their associated phenotypes. Users can then input a patient ID to retrieve potential phenotypic outcomes based on the identified SVs.

Figure 8. Workflow of SVeedy.

Operation

We gained access to a collection of VCFs created to find Tandem Repeats (TRs) (A. C. English et al. 2024) from a collection of 86 haplotypes accumulated from (Garg et al. 2020; Ebert et al. 2021; Jarvis et al. 2022) and (Wang et al. 2022). To effectively assess relatedness between SVs, we need to set a similarity percentage threshold. For example, if obesity is associated with a 100 bp SV compared to the population reference, we would want to determine if 80 out of the 100 bps (80%) are the same. This is one of the goals of the Truvari (v4.2.2) software (A. C. English et al. 2022), which explains that although similar SVs may be present in different samples, they can occur at different loci along the genome. They also caution that over-filtering with the collapse command may remove important regions of the SV. Additionally, we need to carefully define the range of each SV and consider whether to include, for example, a 5 bp buffer on either side of an SV to account for unique alleles. At this point, we could run SURVIVOR (v1.0.7) (Sedlazeck et al. 2017) for analyzing the VCF data. Finally, we used OpenCRAVAT (v2.8.0) (Pagel et al. 2019) to annotate the VCF file using the ClinVar and gnomAD databases and hg38 genome reference. We removed a problematic line of the VCF header (FILTER/COV) before running the input collapsed SV VCF through OpenCRAVAT. We also added in cosmic, gnomad_gene, clinvar_acmg annotators.

After using OpenCRAVAT to annotate the collapsed SVs, individual structural variants were called to associate a patient with an ontology/diagnosis. At this point, all allele frequencies and structural variant occurrences could be analyzed in R.

System requirements: Laptop for data visualization (in R), HPC cluster for SV clustering and annotation

Operating system: Linux (HPC)

Processors: 32 CPUs (single allocated slurm node on HPC cluster)

Memory: 80 GB RAM (single allocated slurm node on HPC)

Software dependencies: Truvari v4.2.2, SURVIVOR v1.0.7, OpenCRAVAT v2.8.0, bcftools v1.9

Workflow overview:

1. Convert input SV files from bcf to vcf format if necessary using bcftools convert (-O v).
2. Filter out SVs under 50 bp in length using bcftools view (-i “SVLEN>50”).
3. Bgzip compress and index filtered SV VCF using bgzip and tabix.
4. Collapse SVs with truvari keeping the most common variant in each case (-k common).
5. Remove VCF header line that prevents OpenCRAVAT from running using bcftools annotate (-x “FILTER/COV”).
6. Annotate resulting VCF with gnomAD, ClinVar, and COSMIC databases given hg38 reference using OpenCRAVAT (-l hg38 -a gnomad gnomad_gene clinvar clinvar_acmg cosmic cosmic_gene -t text excel).

Results/Use cases/Operation

1. Tandem repeats

GTF annotation

We used the reference GTF annotation file from GENCODE (v46) to annotate the TR loci in our database. This annotation provides information on whether a particular TR region is located within an exon, a gene, or in intergenic space. By adding this information, we were able to analyze the varying strengths of TRs across different regions and to assess their impact on population structure prediction.

Length polymorphism score

To calculate the length polymorphism score, we first assessed the allele frequencies and allele counts of individual TRs for each ancestry group separately. The length polymorphism score is a per-locus measure of the proportion of distinct alleles by length relative to the total number of alleles measured at that locus. For the length polymorphism we used a query as tdb query len_poly_score hprc_105.tdb > result.txt.

Fixation index (Fst)

Fixation index (Fst) is a measure of genetic differentiation between populations, quantifying the proportion of genetic variance due to population structure (Meirmans and Hedrick 2011). Here we calculated the Fst of TR alleles across loci. We first run a query to calculate allele counts by population using population_ac_by_length.py to create input_allele_counts.tsv using an equation from (Sampson et al. 2011). We made a query on Fst using python calculate_fst.py -o result.tsv input_allele_counts.tsv.

Population informative TR loci

A baseline PCA of TR alleles across 105 samples showed a decent clustering regarding super population structures ( Figure 9). In the next step, by filtering all TR loci based on our conditions (fixation index > 20 and length polymorphism score > 20), we identified 14 loci of interest for further investigation: 7 within genes, 6 in intergenic regions, and 1 in an exon. We performed PCAs on the loci in genes and intergenic regions to explore the role of TRs in these regions in predicting population structure.

Figure 9. Baseline PCA of all TR alleles across 105 samples.

2. Simulation of mosaic variants

We selected chromosome 22 from the HG002 BAM file to test the TykeVar pipeline due to its high coverage (130x), which ensures reliable detection of variants and provides an ideal dataset for simulating mosaic alterations. Using the TykeVar pipeline, we introduced artificial variants into the BAM file to mimic a range of variant scenarios relevant to mosaicism. Figure 10 (A) illustrates the process of incorporating insertion variants into the BAM file, while Figure 10 (B) showcases mosaic deletions introduced exclusively in the modified BAM file, which were absent in the original dataset. To assess the performance of TykeVar, we employed the Sniffles2 mosaic variant caller, achieving over 80% accuracy in detecting the artificially introduced mosaic reads. These findings confirm that the TykeVar pipeline effectively generates realistic mosaic variants and that our detection strategy is robust, validating the approach for simulating and identifying specific genomic alterations in modified BAM files.

Figure 10. Examples of simulated mosaic variant insertions and deletions.

(A) Injection of artificial variants into the HG002 chromosome 22 BAM file using the TykeVar pipeline.

(B) Visualization of deletions in the modified HG002 chromosome 22 BAM file after TykeVar pipeline processing.

3. AMRDiscovery

We generated overall statistics, temporal analysis and visualization of spatial distribution. Our most analysis focused on four most abundant pathogenic species including Pseudomonas aeruginosa, Acinetobacter baumannii, Klebsiella pneumoniae, Escherichia coli.

Most organisms contribute few AMR genes to the database, while few organisms contribute the bulk of AMR genes. Regarding the number of AMR genes of one species, Pseudonomas aeruginosa, Acinetobacter baumannii, Klebsiella pneumoniae and Escherichia coli far outweighed other species while some clinically significant pathogens did not rank high. For instance, Streptococcus pneumoniae. Most targets of AMR genes included penicillins (penams), carbapenems, monobactams, and cephalosporins, which are all beta-lactams, i.e., the antibiotics inhibiting the synthesis of bacterial cell walls. Other significant targets were aminoglycosides (e.g., amikacin), tetracyclines (e.g., doxycycline & tetracycline), and peptide antibiotics (e.g., bacitracin). The most prevalent antibiotic mechanism is “antibiotic inactivation”, while “antibiotic target alteration” and “antibiotic efflux” ranked second and third respectively.

Our temporal analysis focused on four top species. The occurrence of resistance increased by time in all countries and all types of antibiotics. However, this needs to be normalized by the number of samples in each year because the amount of recent data is greater than early years. If we looked at the mechanism of resistance, the prevalence of each mechanism fluctuated mildly in each species. “Antibiotic efflux” was the most common mechanism in three species but K. pneumoniae had “antibiotic target alteration” the most ( Figure 11).

Figure 11. A) The distribution of AMR genes corresponding to top 20 species. B) The hit numbers of AMR gene In Pseudomonas aeruginosa. C) Trends of AMR mechanisms in four focal species. D) A screenshot of our interactive map.

4. Mobile elements across species

As a result of the hackathon, starfish is now available as a docker container and snakemake pipeline. The use of containerized environments (e.g., docker) and workflow management systems (e.g., snakemake) is crucial for ensuring the reliability and reproducibility of bioinformatics analyses. The docker container provides a consistent and isolated environment, encapsulating all the necessary software dependencies, libraries, and configurations needed to run analyses. The snakemake pipeline enables the automation and organization of the starfish analysis, ensuring that each step is executed in a specified sequence with minimal human intervention. It also enhances the reliability of analyses by allowing for error tracking, version control, and easy debugging. Furthermore, both Docker and Snakemake enable comprehensive documentation of the analysis pipeline, making it easier for others to understand, validate, and reuse the methods. Together, these two tools improve the accessibility and usability of starfish, to facilitate its application to non-fungal genomes.

5. ONT metagenome simulator

We ran the Mimic pipeline on the following SRA samples (SRR29660113 and SRR30413550). Metagenomes were simulated at 1k, 50k, and 100k reads for both ‘perfect’ reads and default error-prone reads, which in this case reflected a ~11% sequencing error rate (4% mismatch, 4% insertion, 3% deletion), which is high but not unrealistic for current ONT devices. For the 1k reads we evaluated the accuracy of Kraken2’s classification of each read at the genus level and above for SRR30414550. Here, the “FN” counts reflect reads that are not classified by Kraken2 at that taxonomic rank. For the ‘perfect’ reads, we get:

rank FN TP FP TN Prec Rec

genus 47 799 154 0 0.838 0.799

family 41 862 97 0 0.899 0.862

order 39 868 93 0 0.903 0.868

class 33 955 12 0 0.988 0.955

phylum 29 965 6 0 0.994 0.965

While for the error-prone reads, we get:

rank FN TP FP TN Prec Rec

genus 227 566 206 1 0.733 0.567

family 219 589 191 1 0.755 0.590

order 214 599 186 1 0.763 0.600

class 176 792 31 1 0.962 0.793

phylum 171 798 30 1 0.964 0.799

These preliminary results illustrate the potential impact of sequencing errors on the accuracy of Kraken2 taxonomic classifications. Kraken2 performs well at all taxonomic ranks for error-free reads. However, when sequencing errors are introduced, Kraken2’s precision and recall drop quickly, particularly at lower taxonomic levels. It is important to note that Kraken2 was designed for use with short, accurate reads and so all considered, demonstrates good flexibility when being used on these long reads.

Although we were able to successfully demonstrate MIMIC’s ability to generate and evaluate simulated ONT reads, we were not able to deploy it onto the DNAnexus environment. However, we were able to detect the reduced precision of Kraken2 performance on reads with simulated error. The next step would be to benchmark our tool against existing metagenomic long read simulators such as CAMISIM. Furthermore, by adding gene gain/loss events in the reference genomes, this can aid in simulated datasets with known ‘ground truth’, that can be either evaluated using existing tools or build efficient pipelines that can effectively quantify these variations.

6. Haploblock clusters

During the hackathon, we used the ARG-Needle software to produce ancestral recombination graphs (ARGs), which are collections of trees that contain nodes corresponding to individual genomes and their ancestors, and edges representing the evolutionary inheritance of genomic variants, for the Chinese Dai in Xishuangbanna, China (CDX) population from the 1000Genomes Database. We hypothesized that we could analyze intra- and interpopulation genomic variation in specific regions of interest (e.g., immunological genes, such as HLA-A) by comparing haplotype blocks that overlap with those regions, therefore, we developed a workflow for converting haplotype blocks into similarity matrices. We expected that such similarity matrices could be useful for studies that examine how recombination affects the genomic structure of a population, or how cis- and trans-effects impact the rare variant penetrance.

We produced a proof-of-concept ARG for a small fraction of the haplotype block (chr6:136011-160001) in the Chinese Dai in Xishuangbanna, China (CDX) population from 1000Genomes. To analyze the result, we converted the ARGN file corresponding to the ARG into a tskit. TreeSequence object, and used tskit to summarize the results ( Figure 12) and to visualize the first tree of the ARG ( Figure 13).

Figure 12. Summary of a proof-of-concept ancestral recombination graph (ARG) inferred from one haplotype block (chr6:136011-160001) of the CDX population from 1000Genomes.

The ARG contains 84 individual trees with 15,981 nodes and 27,540 edges.

Figure 13. Visualization of the first tree from the proof-of-concept ancestral recombination graph (ARG) inferred from one haplotype block (chr6:136011-160001) of the CDX population from 1000Genomes.

Each tree of the ARG represents a fraction of the genome in the CDX population that shares common ancestry. The first tree that we analyzed shows how genomic variation associated with recombination events in one genomic region has been inherited within the population. Further analysis of the ARG, as well as a comparison to the ARG of the haplotype blocks overlapping with other regions of interest could reveal individuals’ risk for certain diseases or inheritance patterns in polygenic diseases, however it was beyond the scope of this hackathon project to conduct such analyses.

Overall, the results of this project constituted an exploration of the idea to analyze haplotype blocks in such a computationally efficient and inexpensive way.

We acknowledge that this undertaking was fraught with challenges inherent to the hackathon framework, such as a lack of time for data exploration and preprocessing, as well as technical difficulties with running the software (Busby et al. 2016). Nevertheless, we expect the result of our project to be useful for the scientific community and serve as a future reference for further projects.

7. Somatic variants in cancer

In our mock example, we applied the tool to a subset of reported structural variants (SVs) from the International Cancer Genome Consortium. After filtering for variant allele frequency (VAF), the dataset included approximately 1,000 deletions, duplications, and inversions. We found that over 900 genes were affected by duplications alone, many of which are involved in known cancer-related pathways ( Figure 14).

Figure 14. Gene ontology terms associated with genes overlapping duplications in the mock dataset.

8. Rapid phenotypic labeling of variants

A previous similar study identified 11 SV loci associated with an increased risk for obesity, with an Odds Ratio exceeding 25% (Walters et al. 2013). This project aims to build upon such findings by extending the analysis to a broader set of SVs and phenotypes, facilitating the discovery of novel genetic contributors to complex traits.

Validation

We validated our pipeline on the Project Adotto assembly-based variant calls from the GIAB tandem repeat benchmark (https://zenodo.org/records/6975244), beginning with SV calls in chromosome 1 (either insertions or deletions). Upon SV filtering steps, we went from 194,098 SVs to 55,905 SVs (remove those under 50 bp in length) and then 29,026 SVs (truvari collapse function keeping most common allele in each cluster).

Gene analysis of chromosome 1

In our analysis of chromosome 1 using the Adotto dataset, we identified genes with the most prevalent allele frequencies across different populations. These allele frequencies, including those for structural variants, were sourced from the gnomAD dataset. This analysis highlights genes that show significant variation in allele frequencies among American, Ashkenazi Jewish, East Asian, Finnish, Non-Fin European, and Other populations.

For example, the gene NFASC, which is involved in neurodevelopmental disorders with central and peripheral motor dysfunction (MIM 609145), shows notable structural variants in East Asian ancestry. The prevalence of structural variants of NFASC in this population underscores the importance of understanding population-specific genetic variations, which can inform physicians and researchers about potential genetic risk factors and guide future studies. Our tool with continued research into these population-specific variants is essential for advancing personalized medicine and improving genetic counseling.

Annotating SVs with ClinVar annotations

We have successfully validated our pipeline by gathering all structural variants (SVs) from the Adotto database and combining them with ClinVar data. This analysis led to the identification of three structural variants classified as pathogenic in ClinVar, affecting a total of eight individuals across the dataset ( Figure 15).

Figure 15. A) Pathological Categories for each Human Chromosome and distinct Structural Variants. B) Top 10 Genes with SVs by Allele Frequency in Chromosome 1.

To facilitate the use of this information, we have developed an additional tool that converts the data into a user-friendly PDF output. This PDF includes the sample name of each individual and the predicted diagnosis based on the known ClinVar phenotypes. It provides a comprehensive report detailing each variant, including all relevant information. We were able to find individual patients with specific ontologies from associated SVs and their genes. An example of this PDF report is shown in Figure 16.

Figure 16. An example output notifying patient HG00733 that they are at risk for multiple conditions as a result of a SV on chromosome 3.

The location of interest is then related to specific ontologies listed on the ClinVar database.

Conclusion and next steps

The concepts developed over the 2024 Baylor College of Medicine/DNAnexus hackathon described here represent novel work across multiple important fields of computational biology. These projects encompass complex regions and variants of the human genome to comprehensive analysis methodologies for AMR across bacteria. These projects individually represent important milestones in their individual fields pushing our capabilities to obtain novel insights into complex data sets and enabling a deeper understanding of important mechanisms. This was enabled by a multinational team of 48 scientists spanning the entire world to facilitate this progress in a FAIR-compliant manner.

1. Tandem repeats

A comprehensive single report of all of these measures would further assist researchers in prioritizing tandem repeats. This study will help to subset TRs and further expand TR-specific kinship analysis. This research could be expanded to relate population structure with methylation data as well as compare TR in genes/promoter vs intergenic as well as check sex chromosome vs autosomes. All the codes and scripts for tandem repeats queries have been added to the tandem repeats github repository.

2. Simulation of mosaic variants

The next steps of our project will focus on developing and releasing a Dockerized version of our mosaic variant detection framework to ensure easy deployment across different environments to improve accessibility and reproducibility. We will further improve the detection of mosaic variants from short-read sequencing data, with a particular focus on identifying single nucleotide substitutions (SNS) and small insertions and deletions (indels) across a range of variant allele frequencies (VAFs). This enhancement aims to increase sensitivity and accuracy in detecting these subtle genomic alterations. Additionally, we will refine the re-genotyping process in the SpikeVar pipeline to improve accuracy in generating the ground truth set for the VCF file from the SpikeVar pipeline.

3. AMRDiscovery

Currently, our interactive heat maps provide easy visualization of AMR genes and its information on the world map. It is novel but requires more finetunes, including but not limited to, making the interface more user-friendly, some subset options and public accessibility. We concluded some trends across years but it would be more clear after normalization and data cleaning.

This is the first time of deep diving in the entire SRA dataset. With our alignments results and parsed metadate, we can investigate more on association between AMR genes and the environments or the hosts. Phylogenetic analysis is another important aspect. We can study the relationships of similar AMR genes of different species, trace the origin of an AMR gene or a specific resistant strain. Because the long-read sequencing technologies are rising, our dataset provides a good chance to see how the sequencing platforms influence the property of Longan unitigs and the alignment.

This project will not only contribute to the understanding of AMR gene distribution but also provide participants with hands-on experience in handling large-scale genomic datasets and applying bioinformatics tools in a real-world context.

4. Mobile elements across species

The Docker container has been created for enhanced scalability and reproducibility. Future goals include adding a workflow (e.g., Snakemake pipeline) and the application of Starfish to non-fungal genomes (particularly mammalian). However, there are anticipated challenges of acquiring the appropriate annotation input files and computational time when moving from small fungal to large mammalian genomes. An alternative approach could involve using different computational tools to identify transposons in eukaryotic organisms.

5. ONT metagenome simulator

We built MIMIC, a metagenome simulator that creates ONT reads based off of the taxonomic composition and error profile of real metagenomic samples. We showed that we can generate simulated samples that accurately reproduce the conditions of actual metagenomic samples, and that the commonly used taxonomic classifier Kraken2 performs poorly on error-prone long reads. Moving forward, MIMIC provides a simple framework to comprehensively evaluate long-read taxonomic profilers on any sample type, allowing researchers to test or develop tools for more precise real world applications.

6. Haploblock clusters

While this challenge was difficult for a hackathon, we were able to lay the groundwork for other teams to work on this particular problem. In fact, a team at the Nucleate Hackathon Challenge in Pittsburgh (October 2024; https://www.nucleate.xyz) was able to make some additional headway on this problem: https://github.com/ShijieTang/BioHack_Haplotype, moving away from explicit ancestral recombination graphs. Ancestral recombination graphs offer a promising alternative to study complex genealogical relationships (Lewanski, Grundler, and Bradburd 2023) and moving toward brute force analysis of local haplotype blocks. This work will continue at the Carnegie Mellon University Libraries Hackathon in March, 2025. Please check https://biohackathons.github.io for additional details.

7. Somatic variants in cancer

The MoVana pipeline is introduced as a specialized tool to focus on mosaic variants, as opposed to a comprehensive characterization of all events in a given call set. By implementing variant allele frequency (VAF)-based filtering, the pipeline enables the selection of putative subclonal mutations. The workflow then intersects these mutations with affected coding sequences and ultimately identifies all genes in the dataset that overlap with a specific SV type. This approach aids users in linking structural variants to their functional consequences, particularly by identifying pathways impacted by SVs through the gene set enrichment analysis implemented in MoVana.

The future direction of the project includes analyzing mosaic SVs in primary tumors against metastasis and relapse states and finding recurrent mosaic events implicated in treatment resistance. This can be achieved through the integration of multiple variant databases like dbVar, ClinVar, and OncoDB to enhance the accuracy and reliability of clinical outcomes given by the pipeline.

8. Rapid phenotypic labeling of variants

The development of this pipeline represents a significant advancement in the annotation and association of structural variants (SVs) with disorders. By combining gnomAD allele frequencies and ClinVar clinical data, our tool facilitates a more straightforward and efficient approach to detecting and analyzing SVs in patient sequences. The integration of phenotypic information with clinical and larger dataset sources enhances the tool’s utility in patient care, leading to more informed predictions and better clinical decision-making. Sveedy streamlines the interpretation of SV data and enhances the ease of accessing detailed diagnostic information, making it an invaluable resource for clinical research and patient care.

Future Directions: An improvement to Sveedy would include incorporating additional methodologies and databases within the workflow to enhance the accuracy and scope of SV detection and annotation. The tool could also be streamlined by organizing a Binder environment for global accessibility and improving the pipeline in terms of data formatting for efficient SV processing and reduced computational overhead. The streamlined design and future expansions aim to set a new standard for bioinformatics workflows in precision medicine.

Data and software availability

In this study we used the following data:

1000Genomes: Genomic data. Accession numbers: CDX, PUR, GBR. Data available from: https://www.internationalgenome.org

UCSC Genome Browser: Genomic data. Accession number: hg38. Data available from: https://genome.ucsc.edu/cgi-bin/hgTables

The Human Pangenome Reference Consortium (HPRC): Genomic data. Accession number: TBD database. Data available from: https://humanpangenome.org

THe Project Adotto Tandem-Repeat Regions and Annotations (v0.3): Genomic data. Data available from: https://zenodo.org/records/8387564

The Logan Database: Genomic data. Accession number: prokaryotic subset. Data available from: https://github.com/IndexThePlanet/Logan; https://registry.opendata.aws/pasteur-logan/

The Comprehensive Antibiotic Resistance Database: Genomic data. Accession number: version-3.3.0. Data available from: https://card.mcmaster.ca/download

The International Cancer Genome Consortium: Genomic data. Accession number: SV calls. Data available from: https://www.icgc-argo.org/

Haplotype collection from (Garg et al. 2020; Ebert et al. 2021; Jarvis et al. 2022) and (Wang et al. 2022).

Software availability:

Tandem Repeats (tdb extensions)

• Source code available from: https://github.com/collaborativebioinformatics/tandemrepeats
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531409
• License: MIT License

starfishDiscovery

• Source code available from: https://github.com/collaborativebioinformatics/starfishDiscovery
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531378
• License: MIT License

SpikeVarTykeVar

• Source code available from: https://github.com/collaborativebioinformatics/SpikeVarTykeVar
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531391
• License: MIT License

SVeedy

• Source code available from: https://github.com/collaborativebioinformatics/SVeedy
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531388
• License: MIT License

MoVana

• Source code available from: https://github.com/collaborativebioinformatics/MoVana
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531386
• License: MIT License

Mimic

• Source code available from: https://github.com/collaborativebioinformatics/Mimic
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531364
• License: MIT License

LLM_SVs

• Source code available from: https://github.com/collaborativebioinformatics/LLM_SVs
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531415
• License: MIT License

Haploblock_Clusters

• Source code available from: https://github.com/collaborativebioinformatics/Haploblock_Clusters
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.14531370
• License: MIT License

Extended data

No extended data are associated with this article.

Acknowledgements

We would like to thank Baylor College of Medicine, Richard Gibbs, Chelette, DNAnexus, Rice University Department of Computer Science, ONT, PacBio, and GreGoR.

References

A Global Reference for Human Genetic Variation. Nature. 2015; 526(7571): 68–74.
Agustinho DP, Yilei F, Menon VK, et al.: Unveiling Microbial Diversity: Harnessing Long-Read Sequencing Technology. Nat. Methods. 2024; 21(6): 954–966. PubMed Abstract | Publisher Full Text | Free Full Text
Alcock BP, Huynh W, Chalil R, et al.: CARD 2023: Expanded Curation, Support for Machine Learning, and Resistome Prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Res. 2023; 51(D1): D690–D699. PubMed Abstract | Publisher Full Text | Free Full Text
Bakhtiari M, Park J, Ding Y-C, et al.: Variable Number Tandem Repeats Mediate the Expression of Proximal Genes. Nat. Commun. 2021; 12(1): 2075. PubMed Abstract | Publisher Full Text | Free Full Text
Biesecker LG, Spinner NB: A Genomic View of Mosaicism and Human Disease. Nat. Rev. Genet. 2013; 14(5): 307–320. PubMed Abstract | Publisher Full Text
Browning BL, Browning SR: Statistical Phasing of 150,119 Sequenced Genomes in the UK Biobank. Am. J. Hum. Genet. 2023; 110(1): 161–165. PubMed Abstract | Publisher Full Text | Free Full Text
Buchfink B, Reuter K, Drost H-G: Sensitive Protein Alignments at Tree-of-Life Scale Using DIAMOND. Nat. Methods. 2021; 18(4): 366–368. PubMed Abstract | Publisher Full Text | Free Full Text
Buchfink B, Xie C, Huson DH: Fast and Sensitive Protein Alignment Using DIAMOND. Nat. Methods. 2015; 12(1): 59–60. PubMed Abstract | Publisher Full Text
Busby B, Lesko M; August 2015 and January 2016 Hackathon participants et al.: Closing Gaps between Open Software and Public Data in a Hackathon Setting: User-Centered Software Prototyping. F1000Res. 2016; 5(May): 672. Publisher Full Text
Butler JM: Genetics and Genomics of Core Short Tandem Repeat Loci Used in Human Identity Testing. J. Forensic Sci. 2006; 51(2): 253–265. PubMed Abstract | Publisher Full Text
Chénais B: Transposable Elements and Human Diseases: Mechanisms and Implication in the Response to Environmental Pollutants. Int. J. Mol. Sci. 2022; 23(5). Publisher Full Text
Chikhi R, Raffestin B, Korobeynikov A, et al.: Logan: Planetary-Scale Genome Assembly Surveys Life’s Diversity. bioRxiv. 2024. Publisher Full Text
Costantino I, Nicodemus J, Chun J: Genomic Mosaicism Formed by Somatic Variation in the Aging and Diseased Brain. Genes. 2021; 12(7): 1071. PubMed Abstract | Publisher Full Text | Free Full Text
Data Portal: n.d. Accessed December 20, 2024. Reference Source
Deb SK, Kalra D, Kubica J, et al.: The Fifth International Hackathon for Developing Computational Cloud-Based Tools and Resources for Pan-Structural Variation and Genomics. F1000Res. 2024; 13(708): 708. Publisher Full Text
Delaneau O, Zagury J-F, Marchini J: Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies. Nat. Methods. 2013; 10(1): 5–6. PubMed Abstract | Publisher Full Text
Depienne C, Mandel J-L: 30 Years of Repeat Expansion Disorders: What Have We Learned and What Are the Remaining Challenges?. Am. J. Hum. Genet. 2021; 108(5): 764–785. PubMed Abstract | Publisher Full Text | Free Full Text
Dolzhenko E, English A, Dashnow H, et al.: Characterization and Visualization of Tandem Repeats at Genome Scale. Nat. Biotechnol. 2024; 42(10): 1606–1614. PubMed Abstract | Publisher Full Text | Free Full Text
Ebert P, Audano PA, Zhu Q, et al.: Haplotype-Resolved Diverse Human Genomes and Integrated Analysis of Structural Variation. Science. 2021; 372. PubMed Abstract | Publisher Full Text | Free Full Text
English AC, Dolzhenko E, Jam HZ, et al.: Analysis and Benchmarking of Small and Large Genomic Variants across Tandem Repeats. Nat. Biotechnol. 2024. April, 1–12.
English AC, Menon VK, Gibbs RA, et al.: Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity. Genome Biol. 2022; 23(1): 1–20. Publisher Full Text
English A, Dolzhenko E, Jam HZ, et al.: Benchmarking of Small and Large Variants across Tandem Repeats. bioRxiv. 2023. PubMed Abstract | Publisher Full Text | Free Full Text
Garg S: Towards Routine Chromosome-Scale Haplotype-Resolved Reconstruction in Cancer Genomics. Nat. Commun. 2023; 14(1): 1–11.
Garg S, Fungtammasan A, Carroll A, et al.: Chromosome-Scale, Haplotype-Resolved Assembly of Human Genomes. Nat. Biotechnol. 2020; 39(3): 309–312. PubMed Abstract | Publisher Full Text
GenBank and WGS Statistics: 2024. December 10, 2024. Reference Source
Gluck-Thaler E, Vogan AA: Systematic Identification of Cargo-Mobilizing Genetic Elements Reveals New Dimensions of Eukaryotic Diversity. Nucleic Acids Res. 2024; 52(10): 5496–5513. PubMed Abstract | Publisher Full Text | Free Full Text
Griffiths RC, Marjoram P: An Ancestral Recombination Graph. In Progress in Population Genetics and Human Evolution. Springer; 1997; 257–270.
Hofmeister RJ, Ribeiro DM, Rubinacci S, et al.: Accurate Rare Variant Phasing of Whole-Genome and Whole-Exome Sequencing Data in the UK Biobank. Nat. Genet. 2023; 55(7): 1243–1249. PubMed Abstract | Publisher Full Text | Free Full Text
Hsieh A, Morton SU, Willcox JAL, et al.: EM-Mosaic Detects Mosaic Point Mutations That Contribute to Congenital Heart Disease. Genome Med. 2020; 12(1): 1–18. Publisher Full Text
Human Genomic Variation: 2023. Reference Source Reference Source
Jarvis ED, Formenti G, Rhie A, et al.: Semi-Automated Assembly of High-Quality Diploid Human Reference Genomes. Nature. 2022; 611(7936): 519–531. PubMed Abstract | Publisher Full Text | Free Full Text
Jiang Q, Wang Y, Li Q, et al.: Sequence Characterization of RET in 117 Chinese Hirschsprung Disease Families Identifies a Large Burden of de Novo and Parental Mosaic Mutations. Orphanet J. Rare Dis. 2019; 14(1): 237. PubMed Abstract | Publisher Full Text | Free Full Text
Katz K, Shutov O, Lapoint R, et al.: The Sequence Read Archive: A Decade More of Explosive Growth. Nucleic Acids Res. 2022; 50(D1): D387–D390. PubMed Abstract | Publisher Full Text | Free Full Text
Kobayashi T: Ribosomal RNA Gene Repeats, Their Stability and Cellular Senescence. Proc. Jpn. Acad. Ser. B Phys. Biol. Sci. 2014; 90(4): 119–129. PubMed Abstract | Publisher Full Text | Free Full Text
Lathe WC, Jennifer WM, Mangan ME, et al.: Genomic Data Resources: Challenges and Promises. Nature Education. 2008; 1(3): 2.
Leitwein M, Duranton M, Rougemont Q, et al.: Using Haplotype Information for Conservation Genomics. Trends Ecol. Evol. 2020; 35(3): 245–258. Publisher Full Text
Levinson G: Rethinking Evolution: The Revolution That’s Hiding In Plain Sight. World Scientific; 2019.
Lewanski AL, Grundler MC, Bradburd GS.: The Era of the ARG: An Empiricist’s Guide to Ancestral Recombination Graphs.2023. Reference Source
Liao W-W, Asri M, Ebler J, et al.: A Draft Human Pangenome Reference. Nature. 2023; 617(7960): 312–324. PubMed Abstract | Publisher Full Text | Free Full Text
Li H: Minimap2: Pairwise Alignment for Nucleotide Sequences. Bioinformatics. 2018; 34(18): 3094–3100. PubMed Abstract | Publisher Full Text | Free Full Text
Luo W-S, Qiang D-R, Zhu W-R, et al.: Haplotype Analysis on Association between C-Reactive Protein Gene and Susceptibility to Type 2 Diabetes Mellitus in Chinese Han Population. Acta Diabetol. 2024; 61(11): 1423–1432. PubMed Abstract | Publisher Full Text
Lupski JR, Montes R, de Oca-Luna S , et al.: DNA Duplication Associated with Charcot-Marie-Tooth Disease Type 1A. Cell. 1991; 66(2): 219–232. Publisher Full Text
Ma M, Li Y, Dai S, et al.: A Meta-Analysis on the Prevalence of Charcot-Marie-Tooth Disease and Related Inherited Peripheral Neuropathies. J. Neurol. 2023; 270(5): 2468–2482. PubMed Abstract | Publisher Full Text
Cartney M, Ann M, Mahmoud M, et al.: An International Virtual Hackathon to Build Tools for the Analysis of Structural Variants within Species Ranging from Coronaviruses to Vertebrates. F1000Res. 2021; 10(246): 246.
McNulty SM, Sullivan BA: Alpha Satellite DNA Biology: Finding Function in the Recesses of the Genome. Chromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology. 2018; 26(3): 115–138. PubMed Abstract | Publisher Full Text | Free Full Text
Meirmans PG, Hedrick PW: Assessing Population Structure: F(ST) and Related Measures. Mol. Ecol. Resour. 2011; 11(1): 5–18. PubMed Abstract | Publisher Full Text
Mimic/README.md at Main · collaborativebioinformatics/Mimic: GitHub.2024. 2024. Reference Source
National Library of Medicine: 2024, December. Reference Source
Gonzalez N, Jairo AS, Zweig ML, et al.: The UCSC Genome Browser Database: 2021 Update. Nucleic Acids Res. 2021; 49(D1): D1046–D1057. Publisher Full Text
Pagel KA, Kim R, Moad K, et al.: OpenCRAVAT, an Open Source Collaborative Platform for the Annotation of Human Genetic Variation. bioRxiv. 2019. Publisher Full Text
Project Adotto Tandem-Repeat Regions and Annotations: 2024. Publisher Full Text
Sakamoto Y, Sereewattanawoot S, Suzuki A: A New Era of Long-Read Sequencing for Cancer Genomics. J. Hum. Genet. 2019; 65(1): 3–10. PubMed Abstract | Publisher Full Text
Sampson J, Kidd KK, Kidd JR, et al.: Selecting SNPs to Identify Ancestry. Ann. Hum. Genet. 2011; 75(4): 539–553. PubMed Abstract | Publisher Full Text | Free Full Text
Sankareswaran A, Kunte P, Fraser DP, et al.: Type 1 Diabetes Genetic Risk Score Classifies Diabetes Subtypes in Indians: Impact of HLA Diversity on the Lower Discriminative Ability. medRxiv. 2024. Publisher Full Text
Sapoval N, Liu Y, Curry KD, et al.: Lightweight Taxonomic Profiling of Long-Read Metagenomic Datasets with Lemur and Magnet. bioRxiv. 2024. Publisher Full Text
Sayers EW, Cavanaugh M, Clark K, et al.: GenBank. Nucleic Acids Res. 2020; 48(D1). Publisher Full Text
Sedlazeck FJ, Dhroso A, Bodian DL, et al.: Tools for Annotation and Comparison of Structural Variation. F1000Res. 2017; 6(October): 1795. PubMed Abstract | Publisher Full Text | Free Full Text
Shipilina D, Pal A, Stankowski S, et al.: On the Origin and Structure of Haplotype Blocks. Mol. Ecol. 2023; 32(6): 1441–1457. PubMed Abstract | Publisher Full Text | Free Full Text
Stavrou M, Kleopa KA: CMT1A Current Gene Therapy Approaches and Promising Biomarkers. Neural Regen. Res. 2023; 18(7): 1434–1440. PubMed Abstract | Publisher Full Text
Sugden R, Kelly R, Davies S: Combatting Antimicrobial Resistance Globally. Nat. Microbiol. 2016; 1(10): 16187. Publisher Full Text
The International HapMap Project: The International HapMap Project. Nature. 2003; 426(6968): 789–796. Publisher Full Text
UK Biobank: 2024. December 10, 2024. Reference Source
Walker K, Kalra D, Lowdon R, et al.: The Third International Hackathon for Applying Insights into Large-Scale Genomic Composition to Use Cases in a Wide Range of Organisms. F1000Res. 2022; 11(530): 530. PubMed Abstract | Publisher Full Text | Free Full Text
Walters RG, Coin LJM, Ruokonen A, et al.: Rare Genomic Structural Variants in Complex Disease: Lessons from the Replication of Associations with Obesity. PLOS ONE. 2013; 8(3): e58048. PubMed Abstract | Publisher Full Text | Free Full Text
Wang T, Antonacci-Fulton L, Howe K, et al.: The Human Pangenome Project: A Global Resource to Map Genomic Diversity. Nature. 2022; 604(7906): 437–446. PubMed Abstract | Publisher Full Text | Free Full Text
Yang C, Chu J, Warren RL, et al.: NanoSim: Nanopore Sequence Read Simulator Based on Statistical Characterization. GigaScience. 2017; 6(4): 1–6. PubMed Abstract | Publisher Full Text
Zhang BC, Biddanda A, Gunnarsson ÁF, et al.: Biobank-Scale Inference of Ancestral Recombination Graphs Enables Genealogical Analysis of Complex Traits. Nat. Genet. 2023; 55(5): 768–776. PubMed Abstract | Publisher Full Text | Free Full Text
Zhang J, Bajari R, Andric D, et al.: The International Cancer Genome Consortium Data Portal. Nat. Biotechnol. 2019; 37(4): 367–369. Publisher Full Text
Zook JM, Catoe D, McDaniel J, et al.: Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials. Scientific Data. 2016; 3(1): 1–26. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 07 Nov 2025

Author details Author details

¹ Baylor College of Medicine Department of Pediatrics, Houston, Texas, 77030, USA
² Cancer and Hematology Center, Texas Children’s Hospital, Houston, TX, 77030, USA
³ Department of Biological Sciences, University of Alabama, Tuscaloosa, 35401, USA
⁴ Baylor College of Medicine Department of Molecular and Human Genetics, Houston, Texas, USA
⁵ Department of Veterinary Pathobiology, Texas A&M University College of Veterinary Medicine and Biomedical Sciences, College Station, 77840, USA
⁶ Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland
⁷ Univ. Grenoble Alpes, CNRS, UMR 5525, TIMC / MAGe, 38000, Grenoble, France
⁸ University of California-Irvine, Department of Ecology and Evolutionary Biology, Irvine, California, USA
⁹ The University of Tennessee Knoxville Department of Microbiology, Knoxville, Tennessee, USA
¹⁰ Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA
¹¹ Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA
¹² Institut Pasteur, Université Paris Cité, Sequence Bioinformatics unit, Paris, 75015, France
¹³ High Institute for Research and Education in Transfusion Medicine, Tehran, Tehran Province, Iran
¹⁴ Department of Clinical Sciences, Lund University, Lund, Sweden
¹⁵ Department of Veterinary Population Medicine, College of Veterinary Medicine, University of Minnesota, St. Paul, USA
¹⁶ Rice University Department of Computer Science, Houston, Texas, USA
¹⁷ Baylor College of Medicine, Houston, Texas, USA
¹⁸ Department of Bioengineering, Northeastern University, 360 Huntington Ave, Boston, MA, MA, 02115, USA
¹⁹ CareDx, 8000 Marina Blvd, Brisbane, CA, 94005, USA
²⁰ Twist Bioscience, South San Francisco, CA, 94080, USA
²¹ The University of Alabama at Birmingham Division of Nephrology, Birmingham, Alabama, USA
²² Complex Trait Genomics Laboratory, Smurfit Institute of Genetics, Trinity College Dublin, Dublin, Ireland
²³ Incident Management Team, Ministry of Health, Uganda, Uganda
²⁴ Section of Epidemiology and Population Sciences,Baylor College of Medicine, Houston, USA
²⁵ DataSentics, Prague, Czech Republic
²⁶ Davies Livestock Research Centre, University of Adelaide, Roseworthy, SA, Australia
²⁷ Johns Hopkins University Department of Computer Science, Baltimore, Maryland, USA
²⁸ Centre for Genomic Regulation (CRG) , C/ del Dr. Aiguader, 88, 08003, Barcelona, Spain
²⁹ Center for Alzheimer’s and Related Dementias (CARD), National Institute on Aging (NIA), National Institutes of Health (NIH), Bethesda, Maryland 20892, Bethesda, Maryland, 20892, USA
³⁰ pondicherry university department of bioinformatics, pondicherry, India
³¹ Department of Medicine, University of Crete, Crete, Greece
³² Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
³³ Novaltech, R&D Division, USA, USA
³⁴ Icahn School of Medicine at Mount Sinai Department of Medicine, New York, New York, USA
³⁵ Home Team Science and Technology Agency, Singapore, Singapore
³⁶ Graduate Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, 77030, USA
³⁷ Museum of Natural History - University of the Philippines Los Baños, Los Baños, Philippines
³⁸ University of Massachusetts Lowell, Lowell, Massachusetts, USA
³⁹ Shahid Beheshti University of Medical Sciences, Tehran, Tehran Province, Iran
⁴⁰ Oxford Nanopore technologies, Oxford, UK
⁴¹ The Patrick G Johnston Centre for Cancer Research, Queen’s University Belfast, Belfast, UK
⁴² Ken Kennedy Institute, Rice University, Houston, TX, USA
⁴³ Rice University Department of Bioengineering, Houston, Texas, USA
⁴⁴ University of Chicago Department of Pathology, Chicago, Illinois, USA
⁴⁵ UK Dementia Research Institute, London, England, UK
⁴⁶ Department of Biotechnology and Genetic Engineering, University of Ain-shams, Cairo, Egypt
⁴⁷ DNAnexus, Inc Mountain View, CA, 94040, USA

Farhang Jaryani
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Bishnu Adhikar
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Shaghayegh Beheshti
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Sarah Fross
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jędrzej Kubica
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jen-Yu Wang
Roles: Software, Writing – Original Draft Preparation, Writing – Review & Editing

Aanuoluwa Adekoya
Roles: Software

Daniel P. Agustinho
Roles: Software

Oluwaseun Akinsulire
Roles: Software

Francesco Andreace
Roles: Software

Abolhassan Bahari
Roles: Software

Christian Brueffer
Roles: Software

Siyuan Cheng
Roles: Software

Jonah Cullen
Roles: Software

Kristen Curry
Roles: Software

Ryan Doughty
Roles: Software

Adam English
Roles: Software

Neda Ghohabi Esfahani
Roles: Software

Natali Gulbahce
Roles: Software

Tina Han
Roles: Software

Nha Van Huynh
Roles: Software

Michal Izydorczyk
Roles: Software

Minal Jamsandekar
Roles: Software

Emrah Kacar
Roles: Software

Arthur Shem Kasambula
Roles: Software

Rupesh K. Kesharwani
Roles: Software

Divya Kalra
Roles: Software

Shwetha V Kumar
Roles: Software

Iva Kotásková
Roles: Software

Callum MacPhillamy
Roles: Software

Sina Majidian
Roles: Software

Mauricio Moldes
Roles: Software

Abraham (Jon) Moller
Roles: Software

Rajarshi Mondal
Roles: Software

Eleni Mourouzidou
Roles: Software

Michael Nute
Roles: Software

Dmitrii Olisov
Roles: Software

Anika Pallapothu
Roles: Software

Meghana Ram
Roles: Software

Marcus Chan Hua Rui
Roles: Software

Philippe Sanio
Roles: Software

Russel T. Santos
Roles: Software

Michael Olufemi
Roles: Software

Narges SangaraniPour
Roles: Software

Moustafa Shokrof
Roles: Software

Sam Stroupe
Roles: Software

Gobikrishnan Subramaniam
Roles: Software

Todd J. Treangen
Roles: Software, Supervision

Pankhuri Wanjari
Roles: Software

Umran Yaman
Roles: Software

Farha zain
Roles: Software

Xinchang Zheng
Roles: Software

Fritz J Sedlazeck
Roles: Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Ben Busby
Roles: Software, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

This article reflects the views of the author and should not be construed to represent FDA's views or policies" BB is a full time employee of DNAnexus, Inc. FS is sponsored by Illumina, PacBio, ONT

Grant information

Shwetha V Kumar is supported by CPRIT grant #RP210037 (PI Aaron Thrift)
Sedlazeck NIH grant: 1UG3NS132105-01, 1U01HG011758-01

The research was supported by Cancer Prevention and Research Institute of Texas under grant number: #RP210037

Sarah Fross is supported by a training grant from the National Institutes of Health under Award Number 5T32GM135748-04.

Shaghayegh Beheshti is supported by a training grant from the National Institutes of Health under Award Number 5T32GM139534-04 and NHGRI U01 HG011758.

Ryan Doughty is supported by a training fellowship from the Gulf Coast Consortia, on the NLM Training Program in Biomedical Informatics & Data Science (T15LM007093)

Jędrzej Kubica is supported by the Ministry of Science and Higher Education (Poland) as a project under the program Excellence Initiative – Research University (2020–2026) (decision no.: IV.2.3./30/2024) and the France 2030 state funding managed by the National Research Agency with the reference "ANR-22-PEPRSN-0013".

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 07 Nov 2025, 14:1231

https://doi.org/10.12688/f1000research.170665.1

Copyright

© 2025 Jaryani F et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Jaryani F, Adhikar B, Beheshti S et al. Sixth Annual BCM Hackathon on Structural Variation and Pangenomics [version 1; peer review: 1 approved with reservations]. F1000Research 2025, 14:1231 (https://doi.org/10.12688/f1000research.170665.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 07 Nov 2025

Views

4

Reviewer Report 21 Nov 2025

Istvan Albert, The Pennsylvania State University, University Park, Pennsylvania, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.188155.r431208

The paper reports on an international three-day hackathon where participants developed eight independent bioinformatics projects on topics such as tandem repeats, mosaic variants, antimicrobial resistance, mobile elements, and metagenome simulation.

While the article documents technical work and ... Continue reading

The paper reports on an international three-day hackathon where participants developed eight independent bioinformatics projects on topics such as tandem repeats, mosaic variants, antimicrobial resistance, mobile elements, and metagenome simulation.

While the article documents technical work and diverse tools, it reads more like a compilation of separate mini-papers than a cohesive report. Each section varies in depth, tone, and formatting, with redundant background material and excessive procedural detail.

It’s actually hard to tell how substantial the work really is, because the writing buries that information under implementation details. A clear summary of impact or outcomes is missing. The abstract’s conclusion reads more like a funding report acknowledgment than a scientific takeaway.

Fundamentally, my issue with this paper is that it is far too long and unnecessarily verbose and tiresome. Who is the target audience? It runs to 33 pages !!! for what is essentially a hackathon report.

If the authors were to feed the text into an LLM and ask it to condense, reorganize, and standardize the structure, the result would likely be vastly improved. As it stands, few readers will have the patience to wade through 33 pages filled with details like which exact version of bcftools was used.

On a more positive side there were nearly fifty contributors from multiple countries working on structural variation, metagenomics, and computational genomics. This diversity of perspectives and datasets is commendable. These aspects justify publication within the “Hackathons” collection, as such reports often emphasize process and community building as much as technical novelty.

But I just wish this was done in some readable and useful 4-5 pages rather than 33 pages

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 07 Nov 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1
Version 1 07 Nov 25	read

Istvan Albert, The Pennsylvania State University, University Park, USA

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

4 Views

21 Nov 2025 | for Version 1

Istvan Albert, The Pennsylvania State University, University Park, Pennsylvania, USA

4 Views Cite this report Responses(0)

Approved With Reservations

The paper reports on an international three-day hackathon where participants developed eight independent bioinformatics projects on topics such as tandem repeats, mosaic variants, antimicrobial resistance, mobile elements, and metagenome simulation.

While the article documents technical work and diverse tools, it reads more like a compilation of separate mini-papers than a cohesive report. Each section varies in depth, tone, and formatting, with redundant background material and excessive procedural detail.

It’s actually hard to tell how substantial the work really is, because the writing buries that information under implementation details. A clear summary of impact or outcomes is missing. The abstract’s conclusion reads more like a funding report acknowledgment than a scientific takeaway.

Fundamentally, my issue with this paper is that it is far too long and unnecessarily verbose and tiresome. Who is the target audience? It runs to 33 pages !!! for what is essentially a hackathon report.

If the authors were to feed the text into an LLM and ask it to condense, reorganize, and standardize the structure, the result would likely be vastly improved. As it stands, few readers will have the patience to wade through 33 pages filled with details like which exact version of bcftools was used.

On a more positive side there were nearly fifty contributors from multiple countries working on structural variation, metagenomics, and computational genomics. This diversity of perspectives and datasets is commendable. These aspects justify publication within the “Hackathons” collection, as such reports often emphasize process and community building as much as technical novelty.

But I just wish this was done in some readable and useful 4-5 pages rather than 33 pages

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] A Global Reference for Human Genetic Variation. Nature. 2015; 526(7571): 68–74.

[2] Agustinho DP, Yilei F, Menon VK, et al.: Unveiling Microbial Diversity: Harnessing Long-Read Sequencing Technology. Nat. Methods. 2024; 21(6): 954–966. PubMed Abstract | Publisher Full Text | Free Full Text

[3] Alcock BP, Huynh W, Chalil R, et al.: CARD 2023: Expanded Curation, Support for Machine Learning, and Resistome Prediction at the Comprehensive Antibiotic Resistance Database. Nucleic Acids Res. 2023; 51(D1): D690–D699. PubMed Abstract | Publisher Full Text | Free Full Text

[4] Bakhtiari M, Park J, Ding Y-C, et al.: Variable Number Tandem Repeats Mediate the Expression of Proximal Genes. Nat. Commun. 2021; 12(1): 2075. PubMed Abstract | Publisher Full Text | Free Full Text

[5] Biesecker LG, Spinner NB: A Genomic View of Mosaicism and Human Disease. Nat. Rev. Genet. 2013; 14(5): 307–320. PubMed Abstract | Publisher Full Text

[6] Browning BL, Browning SR: Statistical Phasing of 150,119 Sequenced Genomes in the UK Biobank. Am. J. Hum. Genet. 2023; 110(1): 161–165. PubMed Abstract | Publisher Full Text | Free Full Text

[7] Buchfink B, Reuter K, Drost H-G: Sensitive Protein Alignments at Tree-of-Life Scale Using DIAMOND. Nat. Methods. 2021; 18(4): 366–368. PubMed Abstract | Publisher Full Text | Free Full Text

[8] Buchfink B, Xie C, Huson DH: Fast and Sensitive Protein Alignment Using DIAMOND. Nat. Methods. 2015; 12(1): 59–60. PubMed Abstract | Publisher Full Text

[9] Busby B, Lesko M; August 2015 and January 2016 Hackathon participants et al.: Closing Gaps between Open Software and Public Data in a Hackathon Setting: User-Centered Software Prototyping. F1000Res. 2016; 5(May): 672. Publisher Full Text

[10] Butler JM: Genetics and Genomics of Core Short Tandem Repeat Loci Used in Human Identity Testing. J. Forensic Sci. 2006; 51(2): 253–265. PubMed Abstract | Publisher Full Text

[11] Chénais B: Transposable Elements and Human Diseases: Mechanisms and Implication in the Response to Environmental Pollutants. Int. J. Mol. Sci. 2022; 23(5). Publisher Full Text

[12] Chikhi R, Raffestin B, Korobeynikov A, et al.: Logan: Planetary-Scale Genome Assembly Surveys Life’s Diversity. bioRxiv. 2024. Publisher Full Text

[13] Costantino I, Nicodemus J, Chun J: Genomic Mosaicism Formed by Somatic Variation in the Aging and Diseased Brain. Genes. 2021; 12(7): 1071. PubMed Abstract | Publisher Full Text | Free Full Text

[14] Data Portal: n.d. Accessed December 20, 2024. Reference Source

[15] Deb SK, Kalra D, Kubica J, et al.: The Fifth International Hackathon for Developing Computational Cloud-Based Tools and Resources for Pan-Structural Variation and Genomics. F1000Res. 2024; 13(708): 708. Publisher Full Text

[16] Delaneau O, Zagury J-F, Marchini J: Improved Whole-Chromosome Phasing for Disease and Population Genetic Studies. Nat. Methods. 2013; 10(1): 5–6. PubMed Abstract | Publisher Full Text

[17] Depienne C, Mandel J-L: 30 Years of Repeat Expansion Disorders: What Have We Learned and What Are the Remaining Challenges?. Am. J. Hum. Genet. 2021; 108(5): 764–785. PubMed Abstract | Publisher Full Text | Free Full Text

[18] Dolzhenko E, English A, Dashnow H, et al.: Characterization and Visualization of Tandem Repeats at Genome Scale. Nat. Biotechnol. 2024; 42(10): 1606–1614. PubMed Abstract | Publisher Full Text | Free Full Text

[19] Ebert P, Audano PA, Zhu Q, et al.: Haplotype-Resolved Diverse Human Genomes and Integrated Analysis of Structural Variation. Science. 2021; 372. PubMed Abstract | Publisher Full Text | Free Full Text

[20] English AC, Dolzhenko E, Jam HZ, et al.: Analysis and Benchmarking of Small and Large Genomic Variants across Tandem Repeats. Nat. Biotechnol. 2024. April, 1–12.

[21] English AC, Menon VK, Gibbs RA, et al.: Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity. Genome Biol. 2022; 23(1): 1–20. Publisher Full Text

[22] English A, Dolzhenko E, Jam HZ, et al.: Benchmarking of Small and Large Variants across Tandem Repeats. bioRxiv. 2023. PubMed Abstract | Publisher Full Text | Free Full Text

[23] Garg S: Towards Routine Chromosome-Scale Haplotype-Resolved Reconstruction in Cancer Genomics. Nat. Commun. 2023; 14(1): 1–11.

[24] Garg S, Fungtammasan A, Carroll A, et al.: Chromosome-Scale, Haplotype-Resolved Assembly of Human Genomes. Nat. Biotechnol. 2020; 39(3): 309–312. PubMed Abstract | Publisher Full Text

[25] GenBank and WGS Statistics: 2024. December 10, 2024. Reference Source

[26] Gluck-Thaler E, Vogan AA: Systematic Identification of Cargo-Mobilizing Genetic Elements Reveals New Dimensions of Eukaryotic Diversity. Nucleic Acids Res. 2024; 52(10): 5496–5513. PubMed Abstract | Publisher Full Text | Free Full Text

[27] Griffiths RC, Marjoram P: An Ancestral Recombination Graph. In Progress in Population Genetics and Human Evolution. Springer; 1997; 257–270.

[28] Hofmeister RJ, Ribeiro DM, Rubinacci S, et al.: Accurate Rare Variant Phasing of Whole-Genome and Whole-Exome Sequencing Data in the UK Biobank. Nat. Genet. 2023; 55(7): 1243–1249. PubMed Abstract | Publisher Full Text | Free Full Text

[29] Hsieh A, Morton SU, Willcox JAL, et al.: EM-Mosaic Detects Mosaic Point Mutations That Contribute to Congenital Heart Disease. Genome Med. 2020; 12(1): 1–18. Publisher Full Text

[30] Human Genomic Variation: 2023. Reference Source Reference Source

[31] Jarvis ED, Formenti G, Rhie A, et al.: Semi-Automated Assembly of High-Quality Diploid Human Reference Genomes. Nature. 2022; 611(7936): 519–531. PubMed Abstract | Publisher Full Text | Free Full Text

[32] Jiang Q, Wang Y, Li Q, et al.: Sequence Characterization of RET in 117 Chinese Hirschsprung Disease Families Identifies a Large Burden of de Novo and Parental Mosaic Mutations. Orphanet J. Rare Dis. 2019; 14(1): 237. PubMed Abstract | Publisher Full Text | Free Full Text

[33] Katz K, Shutov O, Lapoint R, et al.: The Sequence Read Archive: A Decade More of Explosive Growth. Nucleic Acids Res. 2022; 50(D1): D387–D390. PubMed Abstract | Publisher Full Text | Free Full Text

[34] Kobayashi T: Ribosomal RNA Gene Repeats, Their Stability and Cellular Senescence. Proc. Jpn. Acad. Ser. B Phys. Biol. Sci. 2014; 90(4): 119–129. PubMed Abstract | Publisher Full Text | Free Full Text

[35] Lathe WC, Jennifer WM, Mangan ME, et al.: Genomic Data Resources: Challenges and Promises. Nature Education. 2008; 1(3): 2.

[36] Leitwein M, Duranton M, Rougemont Q, et al.: Using Haplotype Information for Conservation Genomics. Trends Ecol. Evol. 2020; 35(3): 245–258. Publisher Full Text

[37] Levinson G: Rethinking Evolution: The Revolution That’s Hiding In Plain Sight. World Scientific; 2019.

[38] Lewanski AL, Grundler MC, Bradburd GS.: The Era of the ARG: An Empiricist’s Guide to Ancestral Recombination Graphs.2023. Reference Source

[39] Liao W-W, Asri M, Ebler J, et al.: A Draft Human Pangenome Reference. Nature. 2023; 617(7960): 312–324. PubMed Abstract | Publisher Full Text | Free Full Text

[40] Li H: Minimap2: Pairwise Alignment for Nucleotide Sequences. Bioinformatics. 2018; 34(18): 3094–3100. PubMed Abstract | Publisher Full Text | Free Full Text

[41] Luo W-S, Qiang D-R, Zhu W-R, et al.: Haplotype Analysis on Association between C-Reactive Protein Gene and Susceptibility to Type 2 Diabetes Mellitus in Chinese Han Population. Acta Diabetol. 2024; 61(11): 1423–1432. PubMed Abstract | Publisher Full Text

[42] Lupski JR, Montes R, de Oca-Luna S , et al.: DNA Duplication Associated with Charcot-Marie-Tooth Disease Type 1A. Cell. 1991; 66(2): 219–232. Publisher Full Text

[43] Ma M, Li Y, Dai S, et al.: A Meta-Analysis on the Prevalence of Charcot-Marie-Tooth Disease and Related Inherited Peripheral Neuropathies. J. Neurol. 2023; 270(5): 2468–2482. PubMed Abstract | Publisher Full Text

[44] Cartney M, Ann M, Mahmoud M, et al.: An International Virtual Hackathon to Build Tools for the Analysis of Structural Variants within Species Ranging from Coronaviruses to Vertebrates. F1000Res. 2021; 10(246): 246.

[45] McNulty SM, Sullivan BA: Alpha Satellite DNA Biology: Finding Function in the Recesses of the Genome. Chromosome Research: An International Journal on the Molecular, Supramolecular and Evolutionary Aspects of Chromosome Biology. 2018; 26(3): 115–138. PubMed Abstract | Publisher Full Text | Free Full Text

[46] Meirmans PG, Hedrick PW: Assessing Population Structure: F(ST) and Related Measures. Mol. Ecol. Resour. 2011; 11(1): 5–18. PubMed Abstract | Publisher Full Text

[47] Mimic/README.md at Main · collaborativebioinformatics/Mimic: GitHub.2024. 2024. Reference Source

[48] National Library of Medicine: 2024, December. Reference Source

[49] Gonzalez N, Jairo AS, Zweig ML, et al.: The UCSC Genome Browser Database: 2021 Update. Nucleic Acids Res. 2021; 49(D1): D1046–D1057. Publisher Full Text

[50] Pagel KA, Kim R, Moad K, et al.: OpenCRAVAT, an Open Source Collaborative Platform for the Annotation of Human Genetic Variation. bioRxiv. 2019. Publisher Full Text

[51] Project Adotto Tandem-Repeat Regions and Annotations: 2024. Publisher Full Text

[52] Sakamoto Y, Sereewattanawoot S, Suzuki A: A New Era of Long-Read Sequencing for Cancer Genomics. J. Hum. Genet. 2019; 65(1): 3–10. PubMed Abstract | Publisher Full Text

[53] Sampson J, Kidd KK, Kidd JR, et al.: Selecting SNPs to Identify Ancestry. Ann. Hum. Genet. 2011; 75(4): 539–553. PubMed Abstract | Publisher Full Text | Free Full Text

[54] Sankareswaran A, Kunte P, Fraser DP, et al.: Type 1 Diabetes Genetic Risk Score Classifies Diabetes Subtypes in Indians: Impact of HLA Diversity on the Lower Discriminative Ability. medRxiv. 2024. Publisher Full Text

[55] Sapoval N, Liu Y, Curry KD, et al.: Lightweight Taxonomic Profiling of Long-Read Metagenomic Datasets with Lemur and Magnet. bioRxiv. 2024. Publisher Full Text

[56] Sayers EW, Cavanaugh M, Clark K, et al.: GenBank. Nucleic Acids Res. 2020; 48(D1). Publisher Full Text

[57] Sedlazeck FJ, Dhroso A, Bodian DL, et al.: Tools for Annotation and Comparison of Structural Variation. F1000Res. 2017; 6(October): 1795. PubMed Abstract | Publisher Full Text | Free Full Text

[58] Shipilina D, Pal A, Stankowski S, et al.: On the Origin and Structure of Haplotype Blocks. Mol. Ecol. 2023; 32(6): 1441–1457. PubMed Abstract | Publisher Full Text | Free Full Text

[59] Stavrou M, Kleopa KA: CMT1A Current Gene Therapy Approaches and Promising Biomarkers. Neural Regen. Res. 2023; 18(7): 1434–1440. PubMed Abstract | Publisher Full Text

[60] Sugden R, Kelly R, Davies S: Combatting Antimicrobial Resistance Globally. Nat. Microbiol. 2016; 1(10): 16187. Publisher Full Text

[61] The International HapMap Project: The International HapMap Project. Nature. 2003; 426(6968): 789–796. Publisher Full Text

[62] UK Biobank: 2024. December 10, 2024. Reference Source

[63] Walker K, Kalra D, Lowdon R, et al.: The Third International Hackathon for Applying Insights into Large-Scale Genomic Composition to Use Cases in a Wide Range of Organisms. F1000Res. 2022; 11(530): 530. PubMed Abstract | Publisher Full Text | Free Full Text

[64] Walters RG, Coin LJM, Ruokonen A, et al.: Rare Genomic Structural Variants in Complex Disease: Lessons from the Replication of Associations with Obesity. PLOS ONE. 2013; 8(3): e58048. PubMed Abstract | Publisher Full Text | Free Full Text

[65] Wang T, Antonacci-Fulton L, Howe K, et al.: The Human Pangenome Project: A Global Resource to Map Genomic Diversity. Nature. 2022; 604(7906): 437–446. PubMed Abstract | Publisher Full Text | Free Full Text

[66] Yang C, Chu J, Warren RL, et al.: NanoSim: Nanopore Sequence Read Simulator Based on Statistical Characterization. GigaScience. 2017; 6(4): 1–6. PubMed Abstract | Publisher Full Text

[67] Zhang BC, Biddanda A, Gunnarsson ÁF, et al.: Biobank-Scale Inference of Ancestral Recombination Graphs Enables Genealogical Analysis of Complex Traits. Nat. Genet. 2023; 55(5): 768–776. PubMed Abstract | Publisher Full Text | Free Full Text

[68] Zhang J, Bajari R, Andric D, et al.: The International Cancer Genome Consortium Data Portal. Nat. Biotechnol. 2019; 37(4): 367–369. Publisher Full Text

[69] Zook JM, Catoe D, McDaniel J, et al.: Extensive Sequencing of Seven Human Genomes to Characterize Benchmark Reference Materials. Scientific Data. 2016; 3(1): 1–26. Publisher Full Text

Sixth Annual BCM Hackathon on Structural Variation and Pangenomics

Abstract

Background

Methods

Results

Conclusions

Keywords

Introduction

1. Tandem Repeats

2. Simulation of mosaic variants

3. AMRDiscovery: Analyzing antimicrobial resistance genes in NCBI sequence read archive

4. Mobile elements across species

5. ONT metagenome simulator

6. Haploblock clusters

7. Somatic variants in cancer

8. Rapid phenotypic labeling of variants

Figure 1. Types of structural variants (“Human Genomic Variation” 2023) (Last updated: February 1, 2023).

Methods

1. Tandem repeats

Figure 2. Workflow for the Tandem Repeat project and the analysis of queries.

2. Simulation of mosaic variants

Figure 3. Overview of SpikeVar and TykeVar workflows.

3. AMRDiscover

Figure 4. The workflow of the AMR discovery project.

4. Mobile elements across species

5. ONT metagenome simulator

Figure 5. The workflow of the ONT Metagenome Simulator project.

6. Haploblock clusters

Figure 6. Workflow of the Haploblock Clusters project.

7. Somatic variants in cancer

Figure 7. Flowchart of the MoVana pipeline.

8. Rapid Phenotypic Labeling of Variants

Figure 8. Workflow of SVeedy.

Results/Use cases/Operation

1. Tandem repeats

Figure 9. Baseline PCA of all TR alleles across 105 samples.

2. Simulation of mosaic variants

Figure 10. Examples of simulated mosaic variant insertions and deletions.

3. AMRDiscovery

Figure 11. A) The distribution of AMR genes corresponding to top 20 species. B) The hit numbers of AMR gene In Pseudomonas aeruginosa. C) Trends of AMR mechanisms in four focal species. D) A screenshot of our interactive map.

4. Mobile elements across species

5. ONT metagenome simulator

6. Haploblock clusters

Figure 12. Summary of a proof-of-concept ancestral recombination graph (ARG) inferred from one haplotype block (chr6:136011-160001) of the CDX population from 1000Genomes.

Figure 13. Visualization of the first tree from the proof-of-concept ancestral recombination graph (ARG) inferred from one haplotype block (chr6:136011-160001) of the CDX population from 1000Genomes.

7. Somatic variants in cancer

Figure 14. Gene ontology terms associated with genes overlapping duplications in the mock dataset.

8. Rapid phenotypic labeling of variants

Figure 15. A) Pathological Categories for each Human Chromosome and distinct Structural Variants. B) Top 10 Genes with SVs by Allele Frequency in Chromosome 1.

Figure 16. An example output notifying patient HG00733 that they are at risk for multiple conditions as a result of a SV on chromosome 3.

Conclusion and next steps

1. Tandem repeats

2. Simulation of mosaic variants

3. AMRDiscovery

4. Mobile elements across species

5. ONT metagenome simulator

6. Haploblock clusters

7. Somatic variants in cancer

8. Rapid phenotypic labeling of variants

Data and software availability

Extended data

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated