The genome sequence of <i>Tethysbaena scabra</i> (Pretus, 1991), the first known in the peracarid crustacean order <i>Thermosbaenacea</i>.

Joan Pons; Karen D. Schöninger-Almaraz; Laura Triginer-Llabrés; Carlos Juan; Damià Jaume; José A. Jurado-Rivera

doi:10.12688/f1000research.161461.3

Home Browse The genome sequence of Tethysbaena scabra (Pretus, 1991), the first...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Genome Note

Revised

The genome sequence of Tethysbaena scabra (Pretus, 1991), the first known in the peracarid crustacean order Thermosbaenacea.

[version 3; peer review: 2 approved]

Joan Pons ¹, Karen D. Schöninger-Almaraz², Laura Triginer-Llabrés², Carlos Juan^1,3, Damià Jaume¹, José A. Jurado-Rivera³

Joan Pons ¹, Karen D. Schöninger-Almaraz², [...] Laura Triginer-Llabrés², Carlos Juan^1,3, Damià Jaume¹, José A. Jurado-Rivera³

PUBLISHED 19 Sep 2025

Author details Author details

¹ Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, Illes Balears, 07190, Spain
² Centre Balear de Biodiversitat, Departament de Biologia, Universitat de les Illes Balears, Palma, Balearic Islands, 07122, Spain
³ Biologia, Universitat de les Illes Balears, Palma, Balearic Islands, 07122, Spain

Joan Pons
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Writing – Original Draft Preparation, Writing – Review & Editing

Karen D. Schöninger-Almaraz
Roles: Data Curation, Formal Analysis, Writing – Original Draft Preparation, Writing – Review & Editing

Laura Triginer-Llabrés
Roles: Data Curation, Formal Analysis, Writing – Original Draft Preparation, Writing – Review & Editing

Carlos Juan
Roles: Conceptualization, Writing – Review & Editing

Damià Jaume
Roles: Conceptualization, Resources, Writing – Review & Editing

José A. Jurado-Rivera
Roles: Conceptualization, Funding Acquisition, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Genomics and Genetics gateway.

Abstract

We present a genome assembly of Tethysbaena scabra (Arthropoda; Crustacea; Malacostraca; Eumalacostraca; Peracarida; Thermosbaenacea; Monodellidae), a species endemic to Mallorca, Spain. The genome size is 1.18 gigabases that is scaffolded into 17 chromosomes plus a mitochondrial genome of 16,5 kilobases in length.

Keywords

Thermosbaenacea, anchialine environment, stygobiont species, Tethysbaena scabra

Corresponding author: Joan Pons

Competing interests: No competing interests were disclosed.

Grant information: Funding: This work has been partially sponsored and promoted by Institut d'Estudis Catalans (Catalan Biogemome Project grant PRO2021-S02-Jurado). The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40,000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands). Some fundings from the Govern de les Illes Balears - Conselleria d’Educació i Universitats and by the European Union - Next Generation EU (BIO2022/013A). KDSA and LTL’s work has been partially funded and promoted by the Comunitat Autònoma de les Illes Balears throgh the Conselleria d'Educació i Universitats and by the European Union - Next Generation EU/PRTR-C17. I1 (SINCO2022/6717). Nevertheless, the views and opinions expressed are solely those of the authors, and do not necessarily reflect those of the Conselleria d’Educació i Universitats, the European Union or the European Commission. Therefore, none of these organizations shall not be held liable. This study has been funded by GOIB/Conselleria d'Educació i Universitats through the project "SINCO2022/18146" and co-funded by the European Union.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2025 Pons J et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Pons J, Schöninger-Almaraz KD, Triginer-Llabrés L et al. The genome sequence of Tethysbaena scabra (Pretus, 1991), the first known in the peracarid crustacean order Thermosbaenacea. [version 3; peer review: 2 approved]. F1000Research 2025, 14:293 (https://doi.org/10.12688/f1000research.161461.3) First published: 14 Mar 2025, 14:293 (https://doi.org/10.12688/f1000research.161461.1) Latest published: 19 Sep 2025, 14:293 (https://doi.org/10.12688/f1000research.161461.3)

Revised Amendments from Version 2

We corrected the accession number of the biosample, improved a sentence to clarify the number of scaffolds and changed figure 4 to show the results after second filtering in Blobtools since previous one represented first filtering.

See the authors' detailed response to the review by Martin Schwentner
See the authors' detailed response to the review by Pascal Angst

Introduction

Tethysbaena scabra (Pretus, 1991) (NCBI:txid203899) is a thermosbaenacean (Crustacea; Multicrustacea; Malacostraca; Eumalacostraca; Peracarida; Thermosbaenacea; Monodellidae), a relict group of peracarid crustaceans characterized by the display in gravid females of a dorsal brood pouch formed by a posterior extension of the carapace (Figure 1). This species measures 2–3 mm in length and is completely eyeless and depigmented, inhabiting subterranean waters of raised salinity in caves and wells located near the marine coast. It is endemic to the Mediterranean islands of Mallorca and Menorca (Balearic Archipelago). Its feeding habits correspond to those of a particle collector, thriving primarily in the pycnoclines that develop within the water column of anchialine caves, where organic debris, bacteria, and fungi accumulate. There is no available information on genome size and chromosome number in thermosbaenaceans. The closest taxa with known information on genome size (https://www.genomesize.com, 1C values in pg) are within the peracarid groups Isopoda (1.70-8.60); Amphipoda (0.52-64.62); and Mysida (10.81-12.00).

Figure 1. Photograph of a Tethysbaena scabra (qmTetScab1) specimen.

The genome sequence from T. scabra will help to study adaptation to underground environments, particularly anchialine ones, that are characterized by oligotrophy, darkness and salinity. The genome of T. scabra was sequenced under the umbrella of the Catalan Initiative for the Earth BioGenome Project (CBP). Here we present a chromosome-level genome assembly for T. scabra from Mallorca, Spain, which represents the first reference genome for the order Thermosbaenacea.

Methods

Specimens were collected in late Spring 2022 with a modified plankton net from the bottom of a well in an old windmill at Es Pil·larí, Palma, Mallorca, Spain (39.533831, 2.747581). Specimens were sorted out under a stereo-microscope (Figure 2). Several batches of 20 specimens each were placed in a cryovial for snap-freezing in liquid nitrogen, and ulteriorly sent in dry ice to the sequencing facilities. Specimens were collected and identified by Damià Jaume. Extraction of High Molecular Weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145, qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

Figure 2. Photograph of Tethysbaena scabra specimens under magnification.

The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Large number of haplotypic duplications presumably caused the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs. Genomic DNA was extracted from individuals whose size is smaller than 5 mm, therefore they were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants. Hence, contig sequences from contaminant species were removed from assembly using two bioinformatic tools, Foreign Contamination Screen (FCS, Astashyn et al., 2024), and Whokaryote (Pronk and Medema, 2022), obtaining 993 contigs. The former achieves this by aligning assemblies, preprocessed to mask repetitive and low-complexity regions, to a curated reference database. The pipeline segments scaffolds into 100-kb subsequences and employs hashed k-mers as alignment seeds. Sequences assigned to taxonomic groups distinct from the query organism (NCBI:txid203899) were then excluded. The latter is a computational tool that differentiates eukaryotic from prokaryotic contig sequences based on fundamental differences in gene structure between the two taxonomic domains. It utilizes a Random Forests approach in combination with Tiara predictions, which incorporate k-mer frequency distributions as classification feature. The assembly was scaffolded with Hi-C data (Rao et al., 2014) using YaHS (Zhou et al., 2023), obtaining 821 scaffolds. The assembly was checked for contamination with two rounds of Blobtools, to ensure complete decontamination, obtaining 59 scaffolds. FCS and Whokaryote removed very few sequences compared to BlobToolKit because the first ones only use a close taxon reference, not available in Thermosbaenacea, and gene structure and domains, while the latter is based on several features (GC content, coverage, BUSCO reference, etc.). The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 gaps of unknown size (represented as 100 consecutive Ns in the FASTA file). Putative sex chromosomes have not been identified, likely due to the genomic material being sourced from a pool of 20 individuals of unknown sex, and the Hi-C data being derived from a separate pool of specimens. Additionally, the coverage obtained has not been sufficient to deduce sex-linked chromosomes. The genome was analysed within the BlobToolKit environment and BUSCO scores were generated (Challis et al., 2020). Table 1 list the software tool versions used, where appropriate. To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated using Meryl and Merqury (Rhie et al., 2020).

Table 1. Software tools: versions and sources.

Software tool	Version	Source
Blastn	2.12.0+	https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html
BlobToolKit	4.3.5	https://github.com/blobtoolkit/blobtoolkit
BUSCO	5.5.0	https://gitlab.com/ezlab/busco/-/archive/5.5.0/busco-5.5.0.zip
FCS	0.5.3	https://github.com/ncbi/fcs
GenomeScope2	2.0	https://github.com/tbenavi1/genomescope2.0
Hifiasm	0.20.0-r639	https://github.com/chhylp123/hifiasm
Merqury	1.3	https://github.com/marbl/merqury
Meryl	1.4.1	https://github.com/marbl/meryl
PretextMap	0.1.9	https://github.com/sanger-tol/PretextMap
RepeatMasker	4.1.7	https://github.com/Dfam-consortium/RepeatMasker
RepeatOBServer	1.0	https://github.com/celphin/RepeatOBserverV1
Purge_dups	1.2.5	https://github.com/dfguan/purge_dups
Smudgeplot	0.3.0	https://github.dev/KamilSJaron/smudgeplot
Whokaryote	1.1.2	https://github.com/LottePronk/whokaryote
YaHS	1.2	https://github.com/c-zhou/yahs

The assembly of mitochondrial genome failed using MitoHiFi (Uliano-Silva et al., 2023), likely due to lack in genome databanks of a mitogenome sequence of sufficiently close taxa. For this reason, sequence contigs were compared with a relaxed BLASTn algorithm against a database built with mitogenome sequences of several peracarid species. The sequence of 30 kb with a positive match was circularized in MitoMaker (Schomaker-Bastos and Prosdocimi, 2018), and annotated in Mitos2 (Donath et al., 2019).

Repetitive annotation was performed using RepeatMasker (Smit et al., 2013–2015) and RepeatOBserver (Elphinstone et al., 2025). The former tool identifies DNA low complexity regions as well as interspersed repeats. In contrast, RepeatOBserver describes tandem repeats and cluster of transposons found on a chromosome level assembly, based in repeat patterns. In also returns a predicted centromere location for each chromosome.

Results

The genome sequence was obtained from a DNA pool of 20 specimens of T. scabra for HiFi data, plus another identical pool for Hi-C data, from individuals collected in a well in Es Pil·larí, Palma, Mallorca, Spain. Two Pacific Biosciences sequencing cells yielded a total of 63.5 giga bases of high-fidelity (HiFi) long reads with a N50 of 13,270 bp, achieving a coverage of 53.8X. Afterward, primary contig assemblies were scaffolded using 73.9 Gb of paired-end Illumina reads derived from chromosome conformation Hi-C data. Manual curation corrected 39 misassemblies, including missing joins and missjoins, resulting in a 0.28% reduction in the total assembly length, a 61.02% decrease in scaffold count, and an 89.99% increase in scaffold N50. The final genome assembly spans 1.18 Gb across 23 scaffolds, with a scaffold N50 of 74.6 Mb (Figure 3, Table 2). GC-coverage (Figure 4) and cumulative sequence plots (Figure 5) from BlobToolKit showed minimal parameter variation with few outliers, and only a very low fraction of sequences failed to match Arthropoda ones deposited in databases. Most of the assembly sequence (99.2%) has been mapped to the final chromosomes. The final assembly sequence confirmed by Hi-C data was assigned to 17 chromosomal-level scaffolds that are designated as they appear in the PretextMap (Figure 6; Table 3). The assembly has a BUSCO v5.5.0 (Manni et al., 2021; Simão FA et al., 2015) completeness of 94.7% (single 93.7%, duplicated 0.7%) using the arthropoda_odb10 reference set. The mitochondrial genome contig can be found within the multifasta file of the genome submission.

Figure 3. Snailplot of the genome assembly of Tethysbaena scabra, qmTetScab1.

This snailplot generated by BlobToolKit displays several metrics, including the longest scaffold, N50, and BUSCO gene completeness, among others. The main plot is segmented into 50 bins, ordered by size around the circumference, with each bin representing 2% of the 1.18 Gbp assembly. Scaffold length distribution is shown in dark grey, with the plot radius scaled to the length of the longest scaffold in the assembly (104 Mbp). Orange and light-orange arcs indicate the N50 and N90 scaffold lengths (74.6 Mbp and 55.4 Mbp, respectively). A pale grey spiral illustrates the cumulative scaffold count on a log scale, with white scale lines marking successive orders of magnitude. The blue and pale-blue areas along the plot's outer edge depict the GC, AT, and N content distribution across these bins. A summary of the BUSCO results appears in the figure’s top right corner.

Table 2. Genome data for Tethysbaena scabra, qmTetScab1.1.

Assembly metrics benchmarks are adapted from the 6.C.Q40 of Earth Biogenome Project from (Lawniczak et al., 2022). BUSCO scores based on the arthropoda_odb10 BUSCO set using v5.5.0. C = complete, [S = single copy, D = duplicated], F = fragmented, M = missing, n = number of orthologues in comparison.

Project accession data
Assembly name	Tethysbaena scabra
Assembly accession	GCA_964277195
Accession of alternate haplotype	-
Span (Mb)	1200
Number of contigs	322
Contig N50 length (Mb)	6.1Mb
Number of scaffolds	23
Scaffold N50 length (Mb)	74.5Mb
Longest scaffold (Mb)	104.45Mb
‍Gaps (bp)	299 standardized 100 bp gaps
Assembly metrics		Benchmark
Consensus quality (QV)	50.41	≥40
K-mer completeness	92.5	≥90
Busco	C:93.7%[S:93,D:0.7%], F:3%,M:3.4%,n:1,013	C ≥90%, D <5%
Percentage of assembly mapped to chromosomes	99.2%	≥90%
Organelles	MT	Complete single alleles

Figure 4. Genome assembly of Tethysbaena scabra, qmTetScab1.1: BlobToolKit GC-coverage plot.

Scaffolds are shown by phylum. Circles are sized in proportion to scaffold length. Histograms show the distribution of scaffold length sum along.

Figure 5. Genome assembly of Tethysbaena scabra: BlobToolKit cumulative sequence plot, qmTetScab1.1.

The gray line represents the cumulative length of all scaffolds, while the colored lines indicate the cumulative lengths of scaffolds assigned to each individual phylum.

Figure 6. Genome assembly of Tethysbaena scabra, qmTetScab1: Hi-C contact map of assembly, visualised using PretextMap.

Chromosomes are shown as they appear in PretextMap, not by size order.

Table 3. Chromosomal pseudomolecules in the genome assembly of Tethysbaena scabra.

https://www.ebi.ac.uk/ena/browser/view/GCA_964277195.1?show=chromosomes.

Accession	Name	Length (Mb)	GC%
OZ195310	tros_1	83.11	33.29
OZ195311	tros_2	104.46	33.18
OZ195312	tros_3	85.72	33.29
OZ195313	tros_4	82.79	33.44
OZ195314	tros_5	87.20	33.33
OZ195315	tros_6	74.56	33.45
OZ195316	tros_7	74.67	33.31
OZ195317	tros_8	72.98	33.51
OZ195318	tros_9	61.70	33.58
OZ195319	tros_10	49.14	33.44
OZ195320	tros_11	56.67	33.72
OZ195321	tros_12	70.05	33.68
OZ195322	tros_13	55.35	33.43
OZ195323	tros_14	59.37	33.46
OZ195324	tros_15	57.12	33.76
OZ195325	tros_16	55.68	33.69
OZ195326	tros_17	45.10	33.69
OZ195327	MT	0.016	32.04

The genome annotation was assessed using BUSCO obtaining: C:93.1% [S:73.2%, D:19.9%], F:2.2%, M:4.7%, also 27,004 transcripts and 22,834 genes. RNAQuast has been performed to check the average alignment length, being 1248.6 bp. Repetitive regions are summarized in Table 4.

Table 4. Summary of the repetitive elements found by RepeatMasker in the genome of Tethysbaena scabra, qmTetScab1.1.

		Number of elements	Length occupied	%
SINEs:		3,285	217,586 bp	0.02%
	ALUs	7	499 bp	0.00%
	MIRs	381	30,645 bp	0.00%
LINEs:		100,876	97,666,309 bp	8.30%
	LINE1	3,138	378,718 bp	0.03%
	LINE2	47,591	44,895,798 bp	3.81%
	L3/CR1	49,210	52,023,230 bp	4.42%
LTR elements:		1,726	541,618 bp	0.05%
	ERVL	80	10,534 bp	0.00%
	ERVL-MaLRs	118	12,225 bp	0.00%
	ERV_classI	1,224	337,524 bp	0.03%
	ERV_classII	46	4,692 bp	0.00%
DNA elements:		39,909	19,071,121 bp	1.62%
	hAT-Charlie	20,903	9,453,122 bp	0.80%
	TcMar-Tigger	3,285	1,466,820 bp	0.12%
Unclassified		20	3,649 bp	0.00%
Total	Interspersed		117,500,283 bp	9.98%
	Small RNA	1,757	176,391 bp	0.01%
Satellites:		94	13,096 bp	0.00%
	Simple repeats	552,457	26,333,387 bp	2.24%
	Low complexity	71,177	3,444,953 bp	0.29%

Ethics and consent

Ethical approval and consent were not required.

Author contributions

Conceptualization (JP, CJ, DJ, JAJR), Data Curation (KDSA, LTL, JP), Formal Analysis (LTL, KDSA, JP), Funding Acquisition (JAJR, JP), Resources (DJ), Writing – Original Draft Preparation (LTL, KDSA, JP), and Writing – Review & Editing (all).

Data and software availability

The Tethysbaena scabra genome project is integrated into the Catalan Initiative for the Earth BioGenome Project (CBP), and all raw data and assembly were deposited in European Nucleotide Archive: Tethysbaena scabra. Accession number PRJEB61927; https://identifiers.org/ena.embl/PRJEB61927. Raw data and assembly accession identifiers are reported in Table 3.

Acknowledgements

We are thankful to the bioinformaticians Jessica Gómez-Garrido and Tyler Alioto (Centre Nacional d’Anàlisi Genòmic, CNAG) and Emilio Righi (Centre for Genomic Regulation, CRG), both in Barcelona (Spain), for their invaluable assistance.

References

Astashyn A, Tvedte ES, Sweeney D, et al.: Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024; 25(1): 60. PubMed Abstract | Publisher Full Text | Free Full Text
Challis R, Richards E, Rajan J, et al.: BlobToolKit–interactive quality assessment of genome assemblies. G3: Genes, Genomes. Genetics. 2020; 10(4): 1361–1374. PubMed Abstract | Publisher Full Text | Free Full Text
Cheng H, Concepcion GT, Feng X, et al.: Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 2021; 18(2): 170–175. PubMed Abstract | Publisher Full Text | Free Full Text
Donath A, Jühling F, Al-Arab M, et al.: Improved annotation of protein-coding genes boundaries in metazoan mitochondrial genomes. Nucleic Acids Res. 2019; 47(20): 10543–10552. PubMed Abstract | Publisher Full Text | Free Full Text
Elphinstone C, Elphinstone R, Todesco M, et al.: RepeatOBserver: Tandem Repeat Visualisation and Putative Centromere Detection. Mol. Ecol. Resour. 2025; e14084. Publisher Full Text
Guan D, McCarthy SA, Wood J, et al.: Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics. 2020; 36(9): 2896–2898. PubMed Abstract | Publisher Full Text | Free Full Text
Harry E: PretextView (Paired REad TEXTure Viewer): A desktop application for viewing pretext contact maps.2022. Reference Source
Manni M, Berkeley MR, Seppey M, et al.: BUSCO: assessing genomic data quality and beyond. Curr. Protoc. 2021; 1: e323. PubMed Abstract | Publisher Full Text
Lawniczak MK, Durbin R, Flicek P, et al.: Standards recommendations for the earth BioGenome project. Proc. Natl. Acad. Sci. 2022; 119(4): e2115639118. PubMed Abstract | Publisher Full Text | Free Full Text
Pronk LJ, Medema MH: Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure. Microb. Genom. 2022; 8: 000823. PubMed Abstract | Publisher Full Text | Free Full Text
Ranallo-Benavidez TR, Jaron KS, Schatz MC: GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 2020; 11(1): 1432. PubMed Abstract | Publisher Full Text | Free Full Text
Rao SS, Huntley MH, Durand NC, et al.: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159(7): 1665–1680. PubMed Abstract | Publisher Full Text | Free Full Text
Rhie A, Walenz BP, Koren S, et al.: Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020; 21: 227–245. PubMed Abstract | Publisher Full Text | Free Full Text
Schomaker-Bastos A, Prosdocimi F: mitoMaker: a pipeline for automatic assembly and annotation of animal mitochondria using raw NGS data. Preprints. 2018. Publisher Full Text
Simão FA, Waterhouse RM, Ioannidis P, et al.: BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015; 31(19): 3210–3212. Publisher Full Text
Smit AFA, Hubley R, Green P: RepeatMasker Open-4.0 [Software].2013–2015. Reference Source
Uliano-Silva M, Ferreira JGR, Krasheninnikova K, et al.: MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics. 2023; 24(1): 288. PubMed Abstract | Publisher Full Text | Free Full Text
Vurture GW, Sedlazeck FJ, Nattestad M, et al.: GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017; 33(14): 2202–2204. PubMed Abstract | Publisher Full Text | Free Full Text
Zhou C, McCarthy SA, Durbin R: YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 2023; 39(1): btac808. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 14 Mar 2025

Author details Author details

Joan Pons
Roles: Conceptualization, Data Curation, Formal Analysis, Funding Acquisition, Writing – Original Draft Preparation, Writing – Review & Editing

Karen D. Schöninger-Almaraz
Roles: Data Curation, Formal Analysis, Writing – Original Draft Preparation, Writing – Review & Editing

Laura Triginer-Llabrés
Roles: Data Curation, Formal Analysis, Writing – Original Draft Preparation, Writing – Review & Editing

Carlos Juan
Roles: Conceptualization, Writing – Review & Editing

Damià Jaume
Roles: Conceptualization, Resources, Writing – Review & Editing

José A. Jurado-Rivera
Roles: Conceptualization, Funding Acquisition, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

Funding: This work has been partially sponsored and promoted by Institut d'Estudis Catalans (Catalan Biogemome Project grant PRO2021-S02-Jurado). The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40,000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands). Some fundings from the Govern de les Illes Balears - Conselleria d’Educació i Universitats and by the European Union - Next Generation EU (BIO2022/013A). KDSA and LTL’s work has been partially funded and promoted by the Comunitat Autònoma de les Illes Balears throgh the Conselleria d'Educació i Universitats and by the European Union - Next Generation EU/PRTR-C17. I1 (SINCO2022/6717). Nevertheless, the views and opinions expressed are solely those of the authors, and do not necessarily reflect those of the Conselleria d’Educació i Universitats, the European Union or the European Commission. Therefore, none of these organizations shall not be held liable. This study has been funded by GOIB/Conselleria d'Educació i Universitats through the project "SINCO2022/18146" and co-funded by the European Union.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (3)

version 3

Revised

Published: 19 Sep 2025, 14:293

https://doi.org/10.12688/f1000research.161461.3

version 2

Revised

Published: 04 Jul 2025, 14:293

https://doi.org/10.12688/f1000research.161461.2

version 1

Published: 14 Mar 2025, 14:293

https://doi.org/10.12688/f1000research.161461.1

© 2025 Pons J et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Pons J, Schöninger-Almaraz KD, Triginer-Llabrés L et al. The genome sequence of Tethysbaena scabra (Pretus, 1991), the first known in the peracarid crustacean order Thermosbaenacea. [version 3; peer review: 2 approved]. F1000Research 2025, 14:293 (https://doi.org/10.12688/f1000research.161461.3)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 3

VERSION 3

PUBLISHED 19 Sep 2025

Revised

Views

Reviewer Report 29 Sep 2025

Martin Schwentner, Naturhistorisches Museum Vienna (Austria), Vienna, Austria

Approved

https://doi.org/10.5256/f1000research.188221.r415498

I think the manuscript is ready for acceptance, the authors ... Continue reading

CITE

Report a concern

Respond or Comment

Version 2

VERSION 2

PUBLISHED 04 Jul 2025

Revised

Views

Reviewer Report 04 Sep 2025

Martin Schwentner, Naturhistorisches Museum Vienna (Austria), Vienna, Austria

Approved with Reservations

https://doi.org/10.5256/f1000research.183424.r406386

The authors present the first genome of the thermosbaenacean, which will be an important resource for future research. The overall manuscript is well written and structured and the methods and results are appropriate and well presented.
I have a few comments that will help to clarify some issues.

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Are the rationale for sequencing the genome and the species significance clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: genomics, systematics, evolutionary research, crustacea

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 12 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

12 Sep 2025

Author Response

Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this ... Continue reading Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

The information about the 299 standardized 100 pb gaps was included in Table 2. The new version also includes that information in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).”.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA from where user can go to right panel and click on Related ENA Records and find links to the results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

GC, length, and coverage after cut-off in Blobtools are indicated in Figure 4. Here we summarize the cut-off filter values implemented in the two rounds of Blobtools:
Fisrt round in Blobtoolkit:
GC_MIN = 0.320
GC_MAX = 0.351
SORTED_ALIGNMENT_COVERAGE_MIN = 36.320
SORTED_ALIGNMENT_COVERAGE_MAX = 680.661
LENGTH_MIN = 10000

Second round in Blobtoolkit:
GC_MIN = 0.330
GC_MAX = 0.340
SORTED_ALIGNMENT_COVERAGE_MIN = 50.000
SORTED_ALIGNMENT_COVERAGE_MAX = 170.000
LENGTH_MIN = 10000

We replaced the results shown in Figure 4, which correspond to the first round of filtering in BlobTools, with those obtained after the second round.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Our previous explanation was confusing, so we provide a new wording in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).
Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

The information about the 299 standardized 100 pb gaps was included in Table 2. The new version also includes that information in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).”.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA from where user can go to right panel and click on Related ENA Records and find links to the results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

GC, length, and coverage after cut-off in Blobtools are indicated in Figure 4. Here we summarize the cut-off filter values implemented in the two rounds of Blobtools:
Fisrt round in Blobtoolkit:
GC_MIN = 0.320
GC_MAX = 0.351
SORTED_ALIGNMENT_COVERAGE_MIN = 36.320
SORTED_ALIGNMENT_COVERAGE_MAX = 680.661
LENGTH_MIN = 10000

Second round in Blobtoolkit:
GC_MIN = 0.330
GC_MAX = 0.340
SORTED_ALIGNMENT_COVERAGE_MIN = 50.000
SORTED_ALIGNMENT_COVERAGE_MAX = 170.000
LENGTH_MIN = 10000

We replaced the results shown in Figure 4, which correspond to the first round of filtering in BlobTools, with those obtained after the second round.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Our previous explanation was confusing, so we provide a new wording in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 12 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

12 Sep 2025

Author Response

Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this ... Continue reading Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

The information about the 299 standardized 100 pb gaps was included in Table 2. The new version also includes that information in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).”.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA from where user can go to right panel and click on Related ENA Records and find links to the results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

GC, length, and coverage after cut-off in Blobtools are indicated in Figure 4. Here we summarize the cut-off filter values implemented in the two rounds of Blobtools:
Fisrt round in Blobtoolkit:
GC_MIN = 0.320
GC_MAX = 0.351
SORTED_ALIGNMENT_COVERAGE_MIN = 36.320
SORTED_ALIGNMENT_COVERAGE_MAX = 680.661
LENGTH_MIN = 10000

Second round in Blobtoolkit:
GC_MIN = 0.330
GC_MAX = 0.340
SORTED_ALIGNMENT_COVERAGE_MIN = 50.000
SORTED_ALIGNMENT_COVERAGE_MAX = 170.000
LENGTH_MIN = 10000

We replaced the results shown in Figure 4, which correspond to the first round of filtering in BlobTools, with those obtained after the second round.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Our previous explanation was confusing, so we provide a new wording in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).
Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

The information about the 299 standardized 100 pb gaps was included in Table 2. The new version also includes that information in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).”.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA from where user can go to right panel and click on Related ENA Records and find links to the results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

GC, length, and coverage after cut-off in Blobtools are indicated in Figure 4. Here we summarize the cut-off filter values implemented in the two rounds of Blobtools:
Fisrt round in Blobtoolkit:
GC_MIN = 0.320
GC_MAX = 0.351
SORTED_ALIGNMENT_COVERAGE_MIN = 36.320
SORTED_ALIGNMENT_COVERAGE_MAX = 680.661
LENGTH_MIN = 10000

Second round in Blobtoolkit:
GC_MIN = 0.330
GC_MAX = 0.340
SORTED_ALIGNMENT_COVERAGE_MIN = 50.000
SORTED_ALIGNMENT_COVERAGE_MAX = 170.000
LENGTH_MIN = 10000

We replaced the results shown in Figure 4, which correspond to the first round of filtering in BlobTools, with those obtained after the second round.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Our previous explanation was confusing, so we provide a new wording in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 22 Aug 2025

Pascal Angst, University of Basel, Basel, Switzerland

Approved

https://doi.org/10.5256/f1000research.183424.r396911

The revised manuscript effectively addresses the comments I made as a reviewer. I appreciate the improvements made and the attention given to the issues I raised. Most of the revisions are clear and well implemented. However, I would be interested ... Continue reading

The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.
“In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?
“The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 11 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

11 Sep 2025

Author Response

We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, ... Continue reading We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

The biosample to obtain PACBIO data is SAMEA113414145. Sorry for the mistake. We fix the error on the new version. The biosample for HiC data is SAMEA118091338.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA. Then, in right panel the user can click on Related ENA Records and find links to all results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

2) “In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

We did not analyze the sequences of centromere repeats. We just annotated them.

3) “The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

We performed a preliminary of the RNA-seq data from another sample (biosample SAMEA117791495) to obtain a preliminary number of coding genes. The workflow consists of several steps. In brief, Illumina paired-end adapters (Nextera PE) were removed from the raw reads using Trimmomatic v0.39. Next, the reference genome assembly was soft-masked with dustmasker, and an index was generated with hisat2-build. The trimmed reads were then aligned to the indexed genome using hisat2, and the resulting alignments were processed with samtools. Finally, gene annotation was performed with BRAKER v3, using a protein database retrieved from OrthoDB as external evidence. RNA-seq data is available in the project (https://identifiers.org/ena.embl/PRJEB61927). Here is a direct link
https://www.ebi.ac.uk/ena/browser/view/ERX14107725. We are still working to make a full annotation in the near future once new funding is obtained.
We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

The biosample to obtain PACBIO data is SAMEA113414145. Sorry for the mistake. We fix the error on the new version. The biosample for HiC data is SAMEA118091338.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA. Then, in right panel the user can click on Related ENA Records and find links to all results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

2) “In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

We did not analyze the sequences of centromere repeats. We just annotated them.

3) “The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

We performed a preliminary of the RNA-seq data from another sample (biosample SAMEA117791495) to obtain a preliminary number of coding genes. The workflow consists of several steps. In brief, Illumina paired-end adapters (Nextera PE) were removed from the raw reads using Trimmomatic v0.39. Next, the reference genome assembly was soft-masked with dustmasker, and an index was generated with hisat2-build. The trimmed reads were then aligned to the indexed genome using hisat2, and the resulting alignments were processed with samtools. Finally, gene annotation was performed with BRAKER v3, using a protein database retrieved from OrthoDB as external evidence. RNA-seq data is available in the project (https://identifiers.org/ena.embl/PRJEB61927). Here is a direct link
https://www.ebi.ac.uk/ena/browser/view/ERX14107725. We are still working to make a full annotation in the near future once new funding is obtained.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 11 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

11 Sep 2025

Author Response

We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, ... Continue reading We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

The biosample to obtain PACBIO data is SAMEA113414145. Sorry for the mistake. We fix the error on the new version. The biosample for HiC data is SAMEA118091338.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA. Then, in right panel the user can click on Related ENA Records and find links to all results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

2) “In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

We did not analyze the sequences of centromere repeats. We just annotated them.

3) “The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

We performed a preliminary of the RNA-seq data from another sample (biosample SAMEA117791495) to obtain a preliminary number of coding genes. The workflow consists of several steps. In brief, Illumina paired-end adapters (Nextera PE) were removed from the raw reads using Trimmomatic v0.39. Next, the reference genome assembly was soft-masked with dustmasker, and an index was generated with hisat2-build. The trimmed reads were then aligned to the indexed genome using hisat2, and the resulting alignments were processed with samtools. Finally, gene annotation was performed with BRAKER v3, using a protein database retrieved from OrthoDB as external evidence. RNA-seq data is available in the project (https://identifiers.org/ena.embl/PRJEB61927). Here is a direct link
https://www.ebi.ac.uk/ena/browser/view/ERX14107725. We are still working to make a full annotation in the near future once new funding is obtained.
We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

The biosample to obtain PACBIO data is SAMEA113414145. Sorry for the mistake. We fix the error on the new version. The biosample for HiC data is SAMEA118091338.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA. Then, in right panel the user can click on Related ENA Records and find links to all results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

2) “In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

We did not analyze the sequences of centromere repeats. We just annotated them.

3) “The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

We performed a preliminary of the RNA-seq data from another sample (biosample SAMEA117791495) to obtain a preliminary number of coding genes. The workflow consists of several steps. In brief, Illumina paired-end adapters (Nextera PE) were removed from the raw reads using Trimmomatic v0.39. Next, the reference genome assembly was soft-masked with dustmasker, and an index was generated with hisat2-build. The trimmed reads were then aligned to the indexed genome using hisat2, and the resulting alignments were processed with samtools. Finally, gene annotation was performed with BRAKER v3, using a protein database retrieved from OrthoDB as external evidence. RNA-seq data is available in the project (https://identifiers.org/ena.embl/PRJEB61927). Here is a direct link
https://www.ebi.ac.uk/ena/browser/view/ERX14107725. We are still working to make a full annotation in the near future once new funding is obtained.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Version 1

VERSION 1

PUBLISHED 14 Mar 2025

Views

Reviewer Report 16 Apr 2025

Pascal Angst, University of Basel, Basel, Switzerland

Approved with Reservations

https://doi.org/10.5256/f1000research.177497.r376178

This genome note presents the genome of Tethysbaena scabra. The authors sampled two pools of specimens for sequencing using PacBio HiFi and Hi-C technologies. They used latest software for assembly of sequencing reads and for assessing the assembly’s quality, completeness, and contamination level. They discussed potential sources of contamination and they separated target versus non-target, contaminant contigs based on specialized software. Generally, this article is sound, but I identified a few inconsistencies and lack of detail which I would like the authors to address. I also expected more details on the sequence (variation) of the genome from a genome note, e.g., a summary of the repeat content and other annotations.

Details on the mandatory reviewer questions:

The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is.
The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. Clearly, sequencing a single individual would be preferable. Related to this, the methods state that two pools of 20 specimens were sampled. However, the results state that “identical” pools were used. The word identical is confusing in that context, because in the methods the pools are described as separate pools. Also, on NCBI, there is only one BioSample (a batch of 20 individuals) registered and is linked to the Illumina and the PacBio sequencing, which is not what is described in the article. The BioSample should either be a batch of 40 individuals or there should be two BioSamples of 20 individuals each.

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used.
To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues?

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results?

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high.
The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions".

Additional aspects:

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content?
It would be important to know the average read lengths or read length N50s.
The keywords should include the full species name.
Are there assembly gaps? What is their size?
There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny.

Are the rationale for sequencing the genome and the species significance clearly described?

Partly
Are the protocols appropriate and is the work technically sound?

Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Partly
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Competing Interests: No competing interests were disclosed.

CITE

Report a concern

Author Response 11 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

11 Sep 2025

Author Response
1. The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation
... Continue reading
The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is. ** We added additional information as suggested. “The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands)”.

The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. ** We agree that our wording was confusing so we rewrote the text to clarify the issue “Extraction of high molecular weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145 qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional de Seqüenciació Genòmica (CNAG), Barcelona, Spain."

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used. ** For the HIFI sequencing, a DNA library was prepared using the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California, CA, USA) following the official protocol. Approximately 300 ng of high-quality genomic DNA was sheared to ~15–20 kb, repaired, and ligated with SMRTbell adapters to create circular molecules. The library was size-selected to remove fragments smaller than 1,000 bp, purified with AMPure PB beads, and quality-checked using Qubit and TapeStation. Finally, it was sequenced on the PacBio Sequel II platform in CCS mode to generate highly accurate HiFi reads suitable for genome assembly and analysis. These previous steps were performed in the University of Delaware (USA). ** Hi-C libraries were prepared using the Omni-C Dovetail protocol (Cantara Bio, CA, USA). Briefly, chromatin was extracted from frozen tissue and crosslinked with formaldehyde to preserve the native three-dimensional genome architecture by stabilizing DNA-protein and DNA-DNA interactions within the nucleus. The crosslinked chromatin was then fragmented using DNase I, and spatially proximal DNA ends were ligated to capture physical interactions between genomic regions, providing long-range linkage information useful for validating and scaffolding genome assemblies. Hi-C library preparation and sequencing were performed at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this. ** The number of haplotypes was considered using the --nhap parameter in hifiasm. Given that our species is presumably diploid, as estimated by Smudgeplot, and the sequencing pool included 20 individuals, the parameter was set to nhap=40. However, due to the presence of multiple individuals, hifiasm may struggle to accurately resolve haplotypes, which may necessitate the use of purge_dups to remove redundant contigs. ** We suspect that most of the duplications are due to the DNA being sourced from a pool of 20 individuals, as a single individual did not provide enough material to construct a HiFi library. We clarify this question in the main text “The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Larger number of Haplotypic duplications presumably caused by the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs". We would like to point out that final genome size after purging duplicates and removing contamination matched the size initially predicted by Genomescope.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues? ** Specimens were isolated from well water and individually collected to minimize contamination from other macroscopic species. However, this approach did not prevent contamination by microscopic organisms.

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results? ** We rewrote the text to be more sound and included a new sentence in the Methods section.

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high. ** Several factors hindered the identification of sex chromosomes in our diploid species. Most prominently, the characteristic haploid coverage pattern typically associated with sex chromosomes was absent. Furthermore, the genome assembly and scaffolding were performed using two separate DNA pools without prior knowledge of the individuals’ sex, complicating the detection of sex-specific sequences. In addition, the lack of biological information on the species and genus—particularly whether sex determination is chromosomal or genetic—further limits the applicability of standard methods for identifying sex chromosomes.

The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions". ** We added the annotation for the repetitive sequences as requested.

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content? ** We present a new table summarizing the chromosomal location, composition, and sequences of the repetitive DNA elements.

It would be important to know the average read lengths or read length N50s. ** The read length N50 of PacBio raw reads has been added to the results section.

The keywords should include the full species name. ** The species name has been added to the keywords section.

Are there assembly gaps? What is their size? ** The final assembly contains 299 gaps, each 100 bp in length. This is due to the default behavior of tools such as Hifiasm, YAHS, and PretextMap, which insert standardized 100 bp gaps when the true gap size cannot be determined.

There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny. ** We appreciate the suggestion to include a phylogenetic analysis. However, given the current lack of available genomic data from other species in the Pericarid crustacean order, we believe that conducting a robust phylogenetic analysis at this stage would not be sufficiently reliable. We agree that such an analysis would be highly valuable, particularly once more genomic data from related taxa becomes available, and it is a future goal of our research group.
The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is. ** We added additional information as suggested. “The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands)”.

The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. ** We agree that our wording was confusing so we rewrote the text to clarify the issue “Extraction of high molecular weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145 qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional de Seqüenciació Genòmica (CNAG), Barcelona, Spain."

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used. ** For the HIFI sequencing, a DNA library was prepared using the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California, CA, USA) following the official protocol. Approximately 300 ng of high-quality genomic DNA was sheared to ~15–20 kb, repaired, and ligated with SMRTbell adapters to create circular molecules. The library was size-selected to remove fragments smaller than 1,000 bp, purified with AMPure PB beads, and quality-checked using Qubit and TapeStation. Finally, it was sequenced on the PacBio Sequel II platform in CCS mode to generate highly accurate HiFi reads suitable for genome assembly and analysis. These previous steps were performed in the University of Delaware (USA). ** Hi-C libraries were prepared using the Omni-C Dovetail protocol (Cantara Bio, CA, USA). Briefly, chromatin was extracted from frozen tissue and crosslinked with formaldehyde to preserve the native three-dimensional genome architecture by stabilizing DNA-protein and DNA-DNA interactions within the nucleus. The crosslinked chromatin was then fragmented using DNase I, and spatially proximal DNA ends were ligated to capture physical interactions between genomic regions, providing long-range linkage information useful for validating and scaffolding genome assemblies. Hi-C library preparation and sequencing were performed at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this. ** The number of haplotypes was considered using the --nhap parameter in hifiasm. Given that our species is presumably diploid, as estimated by Smudgeplot, and the sequencing pool included 20 individuals, the parameter was set to nhap=40. However, due to the presence of multiple individuals, hifiasm may struggle to accurately resolve haplotypes, which may necessitate the use of purge_dups to remove redundant contigs. ** We suspect that most of the duplications are due to the DNA being sourced from a pool of 20 individuals, as a single individual did not provide enough material to construct a HiFi library. We clarify this question in the main text “The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Larger number of Haplotypic duplications presumably caused by the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs". We would like to point out that final genome size after purging duplicates and removing contamination matched the size initially predicted by Genomescope.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues? ** Specimens were isolated from well water and individually collected to minimize contamination from other macroscopic species. However, this approach did not prevent contamination by microscopic organisms.

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results? ** We rewrote the text to be more sound and included a new sentence in the Methods section.

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high. ** Several factors hindered the identification of sex chromosomes in our diploid species. Most prominently, the characteristic haploid coverage pattern typically associated with sex chromosomes was absent. Furthermore, the genome assembly and scaffolding were performed using two separate DNA pools without prior knowledge of the individuals’ sex, complicating the detection of sex-specific sequences. In addition, the lack of biological information on the species and genus—particularly whether sex determination is chromosomal or genetic—further limits the applicability of standard methods for identifying sex chromosomes.

The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions". ** We added the annotation for the repetitive sequences as requested.

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content? ** We present a new table summarizing the chromosomal location, composition, and sequences of the repetitive DNA elements.

It would be important to know the average read lengths or read length N50s. ** The read length N50 of PacBio raw reads has been added to the results section.

The keywords should include the full species name. ** The species name has been added to the keywords section.

Are there assembly gaps? What is their size? ** The final assembly contains 299 gaps, each 100 bp in length. This is due to the default behavior of tools such as Hifiasm, YAHS, and PretextMap, which insert standardized 100 bp gaps when the true gap size cannot be determined.

There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny. ** We appreciate the suggestion to include a phylogenetic analysis. However, given the current lack of available genomic data from other species in the Pericarid crustacean order, we believe that conducting a robust phylogenetic analysis at this stage would not be sufficiently reliable. We agree that such an analysis would be highly valuable, particularly once more genomic data from related taxa becomes available, and it is a future goal of our research group.
Competing Interests: No competing interests Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 11 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

11 Sep 2025

Author Response
1. The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation
... Continue reading
The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is. ** We added additional information as suggested. “The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands)”.

The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. ** We agree that our wording was confusing so we rewrote the text to clarify the issue “Extraction of high molecular weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145 qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional de Seqüenciació Genòmica (CNAG), Barcelona, Spain."

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used. ** For the HIFI sequencing, a DNA library was prepared using the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California, CA, USA) following the official protocol. Approximately 300 ng of high-quality genomic DNA was sheared to ~15–20 kb, repaired, and ligated with SMRTbell adapters to create circular molecules. The library was size-selected to remove fragments smaller than 1,000 bp, purified with AMPure PB beads, and quality-checked using Qubit and TapeStation. Finally, it was sequenced on the PacBio Sequel II platform in CCS mode to generate highly accurate HiFi reads suitable for genome assembly and analysis. These previous steps were performed in the University of Delaware (USA). ** Hi-C libraries were prepared using the Omni-C Dovetail protocol (Cantara Bio, CA, USA). Briefly, chromatin was extracted from frozen tissue and crosslinked with formaldehyde to preserve the native three-dimensional genome architecture by stabilizing DNA-protein and DNA-DNA interactions within the nucleus. The crosslinked chromatin was then fragmented using DNase I, and spatially proximal DNA ends were ligated to capture physical interactions between genomic regions, providing long-range linkage information useful for validating and scaffolding genome assemblies. Hi-C library preparation and sequencing were performed at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this. ** The number of haplotypes was considered using the --nhap parameter in hifiasm. Given that our species is presumably diploid, as estimated by Smudgeplot, and the sequencing pool included 20 individuals, the parameter was set to nhap=40. However, due to the presence of multiple individuals, hifiasm may struggle to accurately resolve haplotypes, which may necessitate the use of purge_dups to remove redundant contigs. ** We suspect that most of the duplications are due to the DNA being sourced from a pool of 20 individuals, as a single individual did not provide enough material to construct a HiFi library. We clarify this question in the main text “The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Larger number of Haplotypic duplications presumably caused by the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs". We would like to point out that final genome size after purging duplicates and removing contamination matched the size initially predicted by Genomescope.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues? ** Specimens were isolated from well water and individually collected to minimize contamination from other macroscopic species. However, this approach did not prevent contamination by microscopic organisms.

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results? ** We rewrote the text to be more sound and included a new sentence in the Methods section.

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high. ** Several factors hindered the identification of sex chromosomes in our diploid species. Most prominently, the characteristic haploid coverage pattern typically associated with sex chromosomes was absent. Furthermore, the genome assembly and scaffolding were performed using two separate DNA pools without prior knowledge of the individuals’ sex, complicating the detection of sex-specific sequences. In addition, the lack of biological information on the species and genus—particularly whether sex determination is chromosomal or genetic—further limits the applicability of standard methods for identifying sex chromosomes.

The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions". ** We added the annotation for the repetitive sequences as requested.

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content? ** We present a new table summarizing the chromosomal location, composition, and sequences of the repetitive DNA elements.

It would be important to know the average read lengths or read length N50s. ** The read length N50 of PacBio raw reads has been added to the results section.

The keywords should include the full species name. ** The species name has been added to the keywords section.

Are there assembly gaps? What is their size? ** The final assembly contains 299 gaps, each 100 bp in length. This is due to the default behavior of tools such as Hifiasm, YAHS, and PretextMap, which insert standardized 100 bp gaps when the true gap size cannot be determined.

There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny. ** We appreciate the suggestion to include a phylogenetic analysis. However, given the current lack of available genomic data from other species in the Pericarid crustacean order, we believe that conducting a robust phylogenetic analysis at this stage would not be sufficiently reliable. We agree that such an analysis would be highly valuable, particularly once more genomic data from related taxa becomes available, and it is a future goal of our research group.
The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is. ** We added additional information as suggested. “The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands)”.

The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. ** We agree that our wording was confusing so we rewrote the text to clarify the issue “Extraction of high molecular weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145 qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional de Seqüenciació Genòmica (CNAG), Barcelona, Spain."

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used. ** For the HIFI sequencing, a DNA library was prepared using the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California, CA, USA) following the official protocol. Approximately 300 ng of high-quality genomic DNA was sheared to ~15–20 kb, repaired, and ligated with SMRTbell adapters to create circular molecules. The library was size-selected to remove fragments smaller than 1,000 bp, purified with AMPure PB beads, and quality-checked using Qubit and TapeStation. Finally, it was sequenced on the PacBio Sequel II platform in CCS mode to generate highly accurate HiFi reads suitable for genome assembly and analysis. These previous steps were performed in the University of Delaware (USA). ** Hi-C libraries were prepared using the Omni-C Dovetail protocol (Cantara Bio, CA, USA). Briefly, chromatin was extracted from frozen tissue and crosslinked with formaldehyde to preserve the native three-dimensional genome architecture by stabilizing DNA-protein and DNA-DNA interactions within the nucleus. The crosslinked chromatin was then fragmented using DNase I, and spatially proximal DNA ends were ligated to capture physical interactions between genomic regions, providing long-range linkage information useful for validating and scaffolding genome assemblies. Hi-C library preparation and sequencing were performed at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this. ** The number of haplotypes was considered using the --nhap parameter in hifiasm. Given that our species is presumably diploid, as estimated by Smudgeplot, and the sequencing pool included 20 individuals, the parameter was set to nhap=40. However, due to the presence of multiple individuals, hifiasm may struggle to accurately resolve haplotypes, which may necessitate the use of purge_dups to remove redundant contigs. ** We suspect that most of the duplications are due to the DNA being sourced from a pool of 20 individuals, as a single individual did not provide enough material to construct a HiFi library. We clarify this question in the main text “The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Larger number of Haplotypic duplications presumably caused by the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs". We would like to point out that final genome size after purging duplicates and removing contamination matched the size initially predicted by Genomescope.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues? ** Specimens were isolated from well water and individually collected to minimize contamination from other macroscopic species. However, this approach did not prevent contamination by microscopic organisms.

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results? ** We rewrote the text to be more sound and included a new sentence in the Methods section.

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high. ** Several factors hindered the identification of sex chromosomes in our diploid species. Most prominently, the characteristic haploid coverage pattern typically associated with sex chromosomes was absent. Furthermore, the genome assembly and scaffolding were performed using two separate DNA pools without prior knowledge of the individuals’ sex, complicating the detection of sex-specific sequences. In addition, the lack of biological information on the species and genus—particularly whether sex determination is chromosomal or genetic—further limits the applicability of standard methods for identifying sex chromosomes.

The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions". ** We added the annotation for the repetitive sequences as requested.

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content? ** We present a new table summarizing the chromosomal location, composition, and sequences of the repetitive DNA elements.

It would be important to know the average read lengths or read length N50s. ** The read length N50 of PacBio raw reads has been added to the results section.

The keywords should include the full species name. ** The species name has been added to the keywords section.

Are there assembly gaps? What is their size? ** The final assembly contains 299 gaps, each 100 bp in length. This is due to the default behavior of tools such as Hifiasm, YAHS, and PretextMap, which insert standardized 100 bp gaps when the true gap size cannot be determined.

There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny. ** We appreciate the suggestion to include a phylogenetic analysis. However, given the current lack of available genomic data from other species in the Pericarid crustacean order, we believe that conducting a robust phylogenetic analysis at this stage would not be sufficiently reliable. We agree that such an analysis would be highly valuable, particularly once more genomic data from related taxa becomes available, and it is a future goal of our research group.
Competing Interests: No competing interests Close
Report a concern

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 14 Mar 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 3 (revision) 19 Sep 25		read
Version 2 (revision) 04 Jul 25	read	read
Version 1 14 Mar 25	read

Pascal Angst, University of Basel, Basel, Switzerland
Martin Schwentner, Naturhistorisches Museum Vienna (Austria), Vienna, Austria

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

9 Views

29 Sep 2025 | for Version 3

Martin Schwentner, Naturhistorisches Museum Vienna (Austria), Vienna, Austria

9 Views Cite this report Responses(0)

Approved

I think the manuscript is ready for acceptance, the authors have responded to all raised issues and altered all relevant sections

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

genomics, systematics, evolutionary research, crustacea

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

17 Views

04 Sep 2025 | for Version 2

Martin Schwentner, Naturhistorisches Museum Vienna (Austria), Vienna, Austria

17 Views Cite this report Responses(1)

Approved With Reservations

Are the rationale for sequencing the genome and the species significance clearly described?

Yes
Are the protocols appropriate and is the work technically sound?

Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

genomics, systematics, evolutionary research, crustacea

Respond to this report

Responses (1)

Author Response

12 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

The information about the 299 standardized 100 pb gaps was included in Table 2. The new version also includes that information in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).”.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA from where user can go to right panel and click on Related ENA Records and find links to the results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

GC, length, and coverage after cut-off in Blobtools are indicated in Figure 4. Here we summarize the cut-off filter values implemented in the two rounds of Blobtools:
Fisrt round in Blobtoolkit:
GC_MIN = 0.320
GC_MAX = 0.351
SORTED_ALIGNMENT_COVERAGE_MIN = 36.320
SORTED_ALIGNMENT_COVERAGE_MAX = 680.661
LENGTH_MIN = 10000

Second round in Blobtoolkit:
GC_MIN = 0.330
GC_MAX = 0.340
SORTED_ALIGNMENT_COVERAGE_MIN = 50.000
SORTED_ALIGNMENT_COVERAGE_MAX = 170.000
LENGTH_MIN = 10000

We replaced the results shown in Figure 4, which correspond to the first round of filtering in BlobTools, with those obtained after the second round.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Our previous explanation was confusing, so we provide a new wording in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

11 Views

22 Aug 2025 | for Version 2

Pascal Angst, University of Basel, Basel, Switzerland

11 Views Cite this report Responses(1)

Approved

The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.
“In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?
“The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

11 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

The biosample to obtain PACBIO data is SAMEA113414145. Sorry for the mistake. We fix the error on the new version. The biosample for HiC data is SAMEA118091338.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA. Then, in right panel the user can click on Related ENA Records and find links to all results of the project:
Result Count
Experiment (PACBIO, HiC, and RNAseq reads) 3
Assembly 1
Genome assembly contig set 1
Sequence (By Chromosome) 18
Run ( PACBIO, HiC, and RNAseq reads) 3
Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

2) “In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

We did not analyze the sequences of centromere repeats. We just annotated them.

3) “The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

We performed a preliminary of the RNA-seq data from another sample (biosample SAMEA117791495) to obtain a preliminary number of coding genes. The workflow consists of several steps. In brief, Illumina paired-end adapters (Nextera PE) were removed from the raw reads using Trimmomatic v0.39. Next, the reference genome assembly was soft-masked with dustmasker, and an index was generated with hisat2-build. The trimmed reads were then aligned to the indexed genome using hisat2, and the resulting alignments were processed with samtools. Finally, gene annotation was performed with BRAKER v3, using a protein database retrieved from OrthoDB as external evidence. RNA-seq data is available in the project (https://identifiers.org/ena.embl/PRJEB61927). Here is a direct link
https://www.ebi.ac.uk/ena/browser/view/ERX14107725. We are still working to make a full annotation in the near future once new funding is obtained.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

25 Views

16 Apr 2025 | for Version 1

Pascal Angst, University of Basel, Basel, Switzerland

25 Views Cite this report Responses(1)

Approved With Reservations

The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is.
The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. Clearly, sequencing a single individual would be preferable. Related to this, the methods state that two pools of 20 specimens were sampled. However, the results state that “identical” pools were used. The word identical is confusing in that context, because in the methods the pools are described as separate pools. Also, on NCBI, there is only one BioSample (a batch of 20 individuals) registered and is linked to the Illumina and the PacBio sequencing, which is not what is described in the article. The BioSample should either be a batch of 40 individuals or there should be two BioSamples of 20 individuals each.

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used.
To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues?

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results?

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high.
The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions".

Additional aspects:

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content?
It would be important to know the average read lengths or read length N50s.
The keywords should include the full species name.
Are there assembly gaps? What is their size?
There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny.

Are the rationale for sequencing the genome and the species significance clearly described?

Partly
Are the protocols appropriate and is the work technically sound?

Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Partly
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Respond to this report

Responses (1)

Author Response

11 Sep 2025

Joan Pons, Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, 07190, Spain

The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is. ** We added additional information as suggested. “The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands)”.
The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. ** We agree that our wording was confusing so we rewrote the text to clarify the issue “Extraction of high molecular weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145 qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional de Seqüenciació Genòmica (CNAG), Barcelona, Spain."
Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used. ** For the HIFI sequencing, a DNA library was prepared using the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California, CA, USA) following the official protocol. Approximately 300 ng of high-quality genomic DNA was sheared to ~15–20 kb, repaired, and ligated with SMRTbell adapters to create circular molecules. The library was size-selected to remove fragments smaller than 1,000 bp, purified with AMPure PB beads, and quality-checked using Qubit and TapeStation. Finally, it was sequenced on the PacBio Sequel II platform in CCS mode to generate highly accurate HiFi reads suitable for genome assembly and analysis. These previous steps were performed in the University of Delaware (USA). ** Hi-C libraries were prepared using the Omni-C Dovetail protocol (Cantara Bio, CA, USA). Briefly, chromatin was extracted from frozen tissue and crosslinked with formaldehyde to preserve the native three-dimensional genome architecture by stabilizing DNA-protein and DNA-DNA interactions within the nucleus. The crosslinked chromatin was then fragmented using DNase I, and spatially proximal DNA ends were ligated to capture physical interactions between genomic regions, providing long-range linkage information useful for validating and scaffolding genome assemblies. Hi-C library preparation and sequencing were performed at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.
To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this. ** The number of haplotypes was considered using the --nhap parameter in hifiasm. Given that our species is presumably diploid, as estimated by Smudgeplot, and the sequencing pool included 20 individuals, the parameter was set to nhap=40. However, due to the presence of multiple individuals, hifiasm may struggle to accurately resolve haplotypes, which may necessitate the use of purge_dups to remove redundant contigs. ** We suspect that most of the duplications are due to the DNA being sourced from a pool of 20 individuals, as a single individual did not provide enough material to construct a HiFi library. We clarify this question in the main text “The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Larger number of Haplotypic duplications presumably caused by the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs". We would like to point out that final genome size after purging duplicates and removing contamination matched the size initially predicted by Genomescope.
“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues? ** Specimens were isolated from well water and individually collected to minimize contamination from other macroscopic species. However, this approach did not prevent contamination by microscopic organisms.
It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results? ** We rewrote the text to be more sound and included a new sentence in the Methods section.
"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high. ** Several factors hindered the identification of sex chromosomes in our diploid species. Most prominently, the characteristic haploid coverage pattern typically associated with sex chromosomes was absent. Furthermore, the genome assembly and scaffolding were performed using two separate DNA pools without prior knowledge of the individuals’ sex, complicating the detection of sex-specific sequences. In addition, the lack of biological information on the species and genus—particularly whether sex determination is chromosomal or genetic—further limits the applicability of standard methods for identifying sex chromosomes.
The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions". ** We added the annotation for the repetitive sequences as requested.
The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content? ** We present a new table summarizing the chromosomal location, composition, and sequences of the repetitive DNA elements.
It would be important to know the average read lengths or read length N50s. ** The read length N50 of PacBio raw reads has been added to the results section.
The keywords should include the full species name. ** The species name has been added to the keywords section.
Are there assembly gaps? What is their size? ** The final assembly contains 299 gaps, each 100 bp in length. This is due to the default behavior of tools such as Hifiasm, YAHS, and PretextMap, which insert standardized 100 bp gaps when the true gap size cannot be determined.
There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny. ** We appreciate the suggestion to include a phylogenetic analysis. However, given the current lack of available genomic data from other species in the Pericarid crustacean order, we believe that conducting a robust phylogenetic analysis at this stage would not be sufficiently reliable. We agree that such an analysis would be highly valuable, particularly once more genomic data from related taxa becomes available, and it is a future goal of our research group.

View more View less

Competing Interests

No competing interests

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Astashyn A, Tvedte ES, Sweeney D, et al.: Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024; 25(1): 60. PubMed Abstract | Publisher Full Text | Free Full Text

[2] Challis R, Richards E, Rajan J, et al.: BlobToolKit–interactive quality assessment of genome assemblies. G3: Genes, Genomes. Genetics. 2020; 10(4): 1361–1374. PubMed Abstract | Publisher Full Text | Free Full Text

[3] Cheng H, Concepcion GT, Feng X, et al.: Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 2021; 18(2): 170–175. PubMed Abstract | Publisher Full Text | Free Full Text

[4] Donath A, Jühling F, Al-Arab M, et al.: Improved annotation of protein-coding genes boundaries in metazoan mitochondrial genomes. Nucleic Acids Res. 2019; 47(20): 10543–10552. PubMed Abstract | Publisher Full Text | Free Full Text

[5] Elphinstone C, Elphinstone R, Todesco M, et al.: RepeatOBserver: Tandem Repeat Visualisation and Putative Centromere Detection. Mol. Ecol. Resour. 2025; e14084. Publisher Full Text

[6] Guan D, McCarthy SA, Wood J, et al.: Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics. 2020; 36(9): 2896–2898. PubMed Abstract | Publisher Full Text | Free Full Text

[7] Harry E: PretextView (Paired REad TEXTure Viewer): A desktop application for viewing pretext contact maps.2022. Reference Source

[8] Manni M, Berkeley MR, Seppey M, et al.: BUSCO: assessing genomic data quality and beyond. Curr. Protoc. 2021; 1: e323. PubMed Abstract | Publisher Full Text

[9] Lawniczak MK, Durbin R, Flicek P, et al.: Standards recommendations for the earth BioGenome project. Proc. Natl. Acad. Sci. 2022; 119(4): e2115639118. PubMed Abstract | Publisher Full Text | Free Full Text

[10] Pronk LJ, Medema MH: Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure. Microb. Genom. 2022; 8: 000823. PubMed Abstract | Publisher Full Text | Free Full Text

[11] Ranallo-Benavidez TR, Jaron KS, Schatz MC: GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 2020; 11(1): 1432. PubMed Abstract | Publisher Full Text | Free Full Text

[12] Rao SS, Huntley MH, Durand NC, et al.: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159(7): 1665–1680. PubMed Abstract | Publisher Full Text | Free Full Text

[13] Rhie A, Walenz BP, Koren S, et al.: Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020; 21: 227–245. PubMed Abstract | Publisher Full Text | Free Full Text

[14] Schomaker-Bastos A, Prosdocimi F: mitoMaker: a pipeline for automatic assembly and annotation of animal mitochondria using raw NGS data. Preprints. 2018. Publisher Full Text

[15] Simão FA, Waterhouse RM, Ioannidis P, et al.: BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015; 31(19): 3210–3212. Publisher Full Text

[16] Smit AFA, Hubley R, Green P: RepeatMasker Open-4.0 [Software].2013–2015. Reference Source

[17] Uliano-Silva M, Ferreira JGR, Krasheninnikova K, et al.: MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics. 2023; 24(1): 288. PubMed Abstract | Publisher Full Text | Free Full Text

[18] Vurture GW, Sedlazeck FJ, Nattestad M, et al.: GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017; 33(14): 2202–2204. PubMed Abstract | Publisher Full Text | Free Full Text

[19] Zhou C, McCarthy SA, Durbin R: YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 2023; 39(1): btac808. PubMed Abstract | Publisher Full Text | Free Full Text

The genome sequence of Tethysbaena scabra (Pretus, 1991), the first known in the peracarid crustacean order Thermosbaenacea.

Abstract

Keywords

Revised Amendments from Version 2

Introduction

Figure 1. Photograph of a Tethysbaena scabra (qmTetScab1) specimen.

Methods

Figure 2. Photograph of Tethysbaena scabra specimens under magnification.

Table 1. Software tools: versions and sources.

Results

Figure 3. Snailplot of the genome assembly of Tethysbaena scabra, qmTetScab1.

Table 2. Genome data for Tethysbaena scabra, qmTetScab1.1.

Figure 4. Genome assembly of Tethysbaena scabra, qmTetScab1.1: BlobToolKit GC-coverage plot.

Figure 5. Genome assembly of Tethysbaena scabra: BlobToolKit cumulative sequence plot, qmTetScab1.1.

Figure 6. Genome assembly of Tethysbaena scabra, qmTetScab1: Hi-C contact map of assembly, visualised using PretextMap.

Table 3. Chromosomal pseudomolecules in the genome assembly of Tethysbaena scabra.

Table 4. Summary of the repetitive elements found by RepeatMasker in the genome of Tethysbaena scabra, qmTetScab1.1.

Ethics and consent

Author contributions

Data and software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated