Amendments from Version 2

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.161461.3

Genome Note

Articles

The genome sequence of Tethysbaena scabra (Pretus, 1991), the first known in the peracarid crustacean order Thermosbaenacea.

[version 3; peer review: 2 approved]

Pons

Joan

Conceptualization Data Curation Formal Analysis Funding Acquisition Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0002-4683-8840 a 1 Schöninger-Almaraz

Karen D.

Data Curation Formal Analysis Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0009-0006-8477-2279 2 Triginer-Llabrés

Laura

Data Curation Formal Analysis Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0009-0002-4680-1172 2 Juan

Carlos

Conceptualization Writing – Review & Editing https://orcid.org/0000-0002-6067-2963 1 3 Jaume

Damià

Conceptualization Resources Writing – Review & Editing 1 Jurado-Rivera

José A.

Conceptualization Funding Acquisition Writing – Review & Editing https://orcid.org/0000-0003-0999-2803 3 1Animal and Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, Illes Balears, 07190, Spain 2Centre Balear de Biodiversitat, Departament de Biologia, Universitat de les Illes Balears, Palma, Balearic Islands, 07122, Spain 3Biologia, Universitat de les Illes Balears, Palma, Balearic Islands, 07122, Spain

a jpons@imedea.uib-csic.es

No competing interests were disclosed.

19 9 2025

2025

293

16 9 2025

2025

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We present a genome assembly of Tethysbaena scabra (Arthropoda; Crustacea; Malacostraca; Eumalacostraca; Peracarida; Thermosbaenacea; Monodellidae), a species endemic to Mallorca, Spain. The genome size is 1.18 gigabases that is scaffolded into 17 chromosomes plus a mitochondrial genome of 16,5 kilobases in length.

Thermosbaenacea anchialine environment stygobiont species Tethysbaena scabra

Govern de les Illes Balears

Conselleriad’EducacióIUniversitatsandbytheEuropeanUnion-NextGenerationEU(BIO2022/013A)

Institut d'Estudis Catalans

CatalanBiogemomeProject(PRO2021-S02-Jurado)

Funding: This work has been partially sponsored and promoted by Institut d'Estudis Catalans (Catalan Biogemome Project grant PRO2021-S02-Jurado). The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40,000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands). Some fundings from the Govern de les Illes Balears - Conselleria d’Educació i Universitats and by the European Union - Next Generation EU (BIO2022/013A). KDSA and LTL’s work has been partially funded and promoted by the Comunitat Autònoma de les Illes Balears throgh the Conselleria d'Educació i Universitats and by the European Union - Next Generation EU/PRTR-C17. I1 (SINCO2022/6717). Nevertheless, the views and opinions expressed are solely those of the authors, and do not necessarily reflect those of the Conselleria d’Educació i Universitats, the European Union or the European Commission. Therefore, none of these organizations shall not be held liable. This study has been funded by GOIB/Conselleria d'Educació i Universitats through the project "SINCO2022/18146" and co-funded by the European Union.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Revised Amendments from Version 2

We corrected the accession number of the biosample, improved a sentence to clarify the number of scaffolds and changed figure 4 to show the results after second filtering in Blobtools since previous one represented first filtering.

Introduction

Tethysbaena scabra (Pretus, 1991) (NCBI:txid203899) is a thermosbaenacean (Crustacea; Multicrustacea; Malacostraca; Eumalacostraca; Peracarida; Thermosbaenacea; Monodellidae), a relict group of peracarid crustaceans characterized by the display in gravid females of a dorsal brood pouch formed by a posterior extension of the carapace ( Figure 1). This species measures 2–3 mm in length and is completely eyeless and depigmented, inhabiting subterranean waters of raised salinity in caves and wells located near the marine coast. It is endemic to the Mediterranean islands of Mallorca and Menorca (Balearic Archipelago). Its feeding habits correspond to those of a particle collector, thriving primarily in the pycnoclines that develop within the water column of anchialine caves, where organic debris, bacteria, and fungi accumulate. There is no available information on genome size and chromosome number in thermosbaenaceans. The closest taxa with known information on genome size ( https://www.genomesize.com, 1C values in pg) are within the peracarid groups Isopoda (1.70-8.60); Amphipoda (0.52-64.62); and Mysida (10.81-12.00).

Figure 1. Photograph of a <italic toggle="yes">Tethysbaena scabra</italic> (qmTetScab1) specimen.

The genome sequence from T. scabra will help to study adaptation to underground environments, particularly anchialine ones, that are characterized by oligotrophy, darkness and salinity. The genome of T. scabra was sequenced under the umbrella of the Catalan Initiative for the Earth BioGenome Project (CBP). Here we present a chromosome-level genome assembly for T. scabra from Mallorca, Spain, which represents the first reference genome for the order Thermosbaenacea.

Methods

Specimens were collected in late Spring 2022 with a modified plankton net from the bottom of a well in an old windmill at Es Pil·larí, Palma, Mallorca, Spain (39.533831, 2.747581). Specimens were sorted out under a stereo-microscope ( Figure 2). Several batches of 20 specimens each were placed in a cryovial for snap-freezing in liquid nitrogen, and ulteriorly sent in dry ice to the sequencing facilities. Specimens were collected and identified by Damià Jaume. Extraction of High Molecular Weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145, qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

Figure 2. Photograph of <italic toggle="yes">Tethysbaena scabra</italic> specimens under magnification.

The genome size was estimated using GenomeScope2 ( Vurture et al., 2017), and diploidy was confirmed with Smudgeplot ( Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm ( Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Large number of haplotypic duplications presumably caused the high number of specimens used for DNA extraction were withdrawn with purge_dups ( Guan et al., 2020), passing from 2208 to 1272 contigs. Genomic DNA was extracted from individuals whose size is smaller than 5 mm, therefore they were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants. Hence, contig sequences from contaminant species were removed from assembly using two bioinformatic tools, Foreign Contamination Screen (FCS, Astashyn et al., 2024), and Whokaryote ( Pronk and Medema, 2022), obtaining 993 contigs. The former achieves this by aligning assemblies, preprocessed to mask repetitive and low-complexity regions, to a curated reference database. The pipeline segments scaffolds into 100-kb subsequences and employs hashed k-mers as alignment seeds. Sequences assigned to taxonomic groups distinct from the query organism (NCBI:txid203899) were then excluded. The latter is a computational tool that differentiates eukaryotic from prokaryotic contig sequences based on fundamental differences in gene structure between the two taxonomic domains. It utilizes a Random Forests approach in combination with Tiara predictions, which incorporate k-mer frequency distributions as classification feature. The assembly was scaffolded with Hi-C data ( Rao et al., 2014) using YaHS ( Zhou et al., 2023), obtaining 821 scaffolds. The assembly was checked for contamination with two rounds of Blobtools, to ensure complete decontamination, obtaining 59 scaffolds. FCS and Whokaryote removed very few sequences compared to BlobToolKit because the first ones only use a close taxon reference, not available in Thermosbaenacea, and gene structure and domains, while the latter is based on several features (GC content, coverage, BUSCO reference, etc.). The contact map was curated using Pretext ( Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 gaps of unknown size (represented as 100 consecutive Ns in the FASTA file). Putative sex chromosomes have not been identified, likely due to the genomic material being sourced from a pool of 20 individuals of unknown sex, and the Hi-C data being derived from a separate pool of specimens. Additionally, the coverage obtained has not been sufficient to deduce sex-linked chromosomes. The genome was analysed within the BlobToolKit environment and BUSCO scores were generated ( Challis et al., 2020). Table 1 list the software tool versions used, where appropriate. To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated using Meryl and Merqury ( Rhie et al., 2020).

Table 1. Software tools: versions and sources.

Software tool	Version	Source
Blastn	2.12.0+	https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html
BlobToolKit	4.3.5	https://github.com/blobtoolkit/blobtoolkit
BUSCO	5.5.0	https://gitlab.com/ezlab/busco/-/archive/5.5.0/busco-5.5.0.zip
FCS	0.5.3	https://github.com/ncbi/fcs
GenomeScope2	2.0	https://github.com/tbenavi1/genomescope2.0
Hifiasm	0.20.0-r639	https://github.com/chhylp123/hifiasm
Merqury	1.3	https://github.com/marbl/merqury
Meryl	1.4.1	https://github.com/marbl/meryl
PretextMap	0.1.9	https://github.com/sanger-tol/PretextMap
RepeatMasker	4.1.7	https://github.com/Dfam-consortium/RepeatMasker
RepeatOBServer	1.0	https://github.com/celphin/RepeatOBserverV1
Purge_dups	1.2.5	https://github.com/dfguan/purge_dups
Smudgeplot	0.3.0	https://github.dev/KamilSJaron/smudgeplot
Whokaryote	1.1.2	https://github.com/LottePronk/whokaryote
YaHS	1.2	https://github.com/c-zhou/yahs

The assembly of mitochondrial genome failed using MitoHiFi ( Uliano-Silva et al., 2023), likely due to lack in genome databanks of a mitogenome sequence of sufficiently close taxa. For this reason, sequence contigs were compared with a relaxed BLASTn algorithm against a database built with mitogenome sequences of several peracarid species. The sequence of 30 kb with a positive match was circularized in MitoMaker ( Schomaker-Bastos and Prosdocimi, 2018), and annotated in Mitos2 ( Donath et al., 2019).

Repetitive annotation was performed using RepeatMasker ( Smit et al., 2013–2015) and RepeatOBserver ( Elphinstone et al., 2025). The former tool identifies DNA low complexity regions as well as interspersed repeats. In contrast, RepeatOBserver describes tandem repeats and cluster of transposons found on a chromosome level assembly, based in repeat patterns. In also returns a predicted centromere location for each chromosome.

Results

The genome sequence was obtained from a DNA pool of 20 specimens of T. scabra for HiFi data, plus another identical pool for Hi-C data, from individuals collected in a well in Es Pil·larí, Palma, Mallorca, Spain. Two Pacific Biosciences sequencing cells yielded a total of 63.5 giga bases of high-fidelity (HiFi) long reads with a N50 of 13,270 bp, achieving a coverage of 53.8X. Afterward, primary contig assemblies were scaffolded using 73.9 Gb of paired-end Illumina reads derived from chromosome conformation Hi-C data. Manual curation corrected 39 misassemblies, including missing joins and missjoins, resulting in a 0.28% reduction in the total assembly length, a 61.02% decrease in scaffold count, and an 89.99% increase in scaffold N50. The final genome assembly spans 1.18 Gb across 23 scaffolds, with a scaffold N50 of 74.6 Mb ( Figure 3, Table 2). GC-coverage ( Figure 4) and cumulative sequence plots ( Figure 5) from BlobToolKit showed minimal parameter variation with few outliers, and only a very low fraction of sequences failed to match Arthropoda ones deposited in databases. Most of the assembly sequence (99.2%) has been mapped to the final chromosomes. The final assembly sequence confirmed by Hi-C data was assigned to 17 chromosomal-level scaffolds that are designated as they appear in the PretextMap ( Figure 6; Table 3). The assembly has a BUSCO v5.5.0 ( Manni et al., 2021; Simão FA et al., 2015) completeness of 94.7% (single 93.7%, duplicated 0.7%) using the arthropoda_odb10 reference set. The mitochondrial genome contig can be found within the multifasta file of the genome submission.

Figure 3. Snailplot of the genome assembly of <italic toggle="yes">Tethysbaena scabra</italic>, qmTetScab1.

This snailplot generated by BlobToolKit displays several metrics, including the longest scaffold, N50, and BUSCO gene completeness, among others. The main plot is segmented into 50 bins, ordered by size around the circumference, with each bin representing 2% of the 1.18 Gbp assembly. Scaffold length distribution is shown in dark grey, with the plot radius scaled to the length of the longest scaffold in the assembly (104 Mbp). Orange and light-orange arcs indicate the N50 and N90 scaffold lengths (74.6 Mbp and 55.4 Mbp, respectively). A pale grey spiral illustrates the cumulative scaffold count on a log scale, with white scale lines marking successive orders of magnitude. The blue and pale-blue areas along the plot's outer edge depict the GC, AT, and N content distribution across these bins. A summary of the BUSCO results appears in the figure’s top right corner.

Table 2. Genome data for <italic toggle="yes">Tethysbaena scabra</italic>, qmTetScab1.1.

Assembly metrics benchmarks are adapted from the 6.C.Q40 of Earth Biogenome Project from ( Lawniczak et al., 2022). BUSCO scores based on the arthropoda_odb10 BUSCO set using v5.5.0. C = complete, [S = single copy, D = duplicated], F = fragmented, M = missing, n = number of orthologues in comparison.

Project accession data
Assembly name	Tethysbaena scabra
Assembly accession	GCA_964277195
Accession of alternate haplotype	-
Span (Mb)	1200
Number of contigs	322
Contig N50 length (Mb)	6.1Mb
Number of scaffolds	23
Scaffold N50 length (Mb)	74.5Mb
Longest scaffold (Mb)	104.45Mb
‍Gaps (bp)	299 standardized 100 bp gaps
Assembly metrics		Benchmark
Consensus quality (QV)	50.41	≥40
K-mer completeness	92.5	≥90
Busco	C:93.7%[S:93,D:0.7%], F:3%,M:3.4%,n:1,013	C ≥90%, D <5%
Percentage of assembly mapped to chromosomes	99.2%	≥90%
Organelles	MT	Complete single alleles

Figure 4. Genome assembly of <italic toggle="yes">Tethysbaena scabra, </italic> qmTetScab1.1: BlobToolKit GC-coverage plot.

Scaffolds are shown by phylum. Circles are sized in proportion to scaffold length. Histograms show the distribution of scaffold length sum along.

Figure 5. Genome assembly of <italic toggle="yes">Tethysbaena scabra</italic>: BlobToolKit cumulative sequence plot, qmTetScab1.1.

The gray line represents the cumulative length of all scaffolds, while the colored lines indicate the cumulative lengths of scaffolds assigned to each individual phylum.

Figure 6. Genome assembly of <italic toggle="yes">Tethysbaena scabra, </italic> qmTetScab1: Hi-C contact map of assembly, visualised using PretextMap.

Chromosomes are shown as they appear in PretextMap, not by size order.

Table 3. Chromosomal pseudomolecules in the genome assembly of <italic toggle="yes">Tethysbaena scabra.</italic>

https://www.ebi.ac.uk/ena/browser/view/GCA_964277195.1?show=chromosomes.

Accession	Name	Length (Mb)	GC%
OZ195310	tros_1	83.11	33.29
OZ195311	tros_2	104.46	33.18
OZ195312	tros_3	85.72	33.29
OZ195313	tros_4	82.79	33.44
OZ195314	tros_5	87.20	33.33
OZ195315	tros_6	74.56	33.45
OZ195316	tros_7	74.67	33.31
OZ195317	tros_8	72.98	33.51
OZ195318	tros_9	61.70	33.58
OZ195319	tros_10	49.14	33.44
OZ195320	tros_11	56.67	33.72
OZ195321	tros_12	70.05	33.68
OZ195322	tros_13	55.35	33.43
OZ195323	tros_14	59.37	33.46
OZ195324	tros_15	57.12	33.76
OZ195325	tros_16	55.68	33.69
OZ195326	tros_17	45.10	33.69
OZ195327	MT	0.016	32.04

The genome annotation was assessed using BUSCO obtaining: C:93.1% [S:73.2%, D:19.9%], F:2.2%, M:4.7%, also 27,004 transcripts and 22,834 genes. RNAQuast has been performed to check the average alignment length, being 1248.6 bp. Repetitive regions are summarized in Table 4.

Table 4. Summary of the repetitive elements found by RepeatMasker in the genome of <italic toggle="yes">Tethysbaena scabra</italic>, qmTetScab1.1.

		Number of elements	Length occupied	%
SINEs:		3,285	217,586 bp	0.02%
	ALUs	7	499 bp	0.00%
	MIRs	381	30,645 bp	0.00%
LINEs:		100,876	97,666,309 bp	8.30%
	LINE1	3,138	378,718 bp	0.03%
	LINE2	47,591	44,895,798 bp	3.81%
	L3/CR1	49,210	52,023,230 bp	4.42%
LTR elements:		1,726	541,618 bp	0.05%
	ERVL	80	10,534 bp	0.00%
	ERVL-MaLRs	118	12,225 bp	0.00%
	ERV_classI	1,224	337,524 bp	0.03%
	ERV_classII	46	4,692 bp	0.00%
DNA elements:		39,909	19,071,121 bp	1.62%
	hAT-Charlie	20,903	9,453,122 bp	0.80%
	TcMar-Tigger	3,285	1,466,820 bp	0.12%
Unclassified		20	3,649 bp	0.00%
Total	Interspersed		117,500,283 bp	9.98%
	Small RNA	1,757	176,391 bp	0.01%
Satellites:		94	13,096 bp	0.00%
	Simple repeats	552,457	26,333,387 bp	2.24%
	Low complexity	71,177	3,444,953 bp	0.29%

Ethics and consent

Ethical approval and consent were not required.

Author contributions

Conceptualization (JP, CJ, DJ, JAJR), Data Curation (KDSA, LTL, JP), Formal Analysis (LTL, KDSA, JP), Funding Acquisition (JAJR, JP), Resources (DJ), Writing – Original Draft Preparation (LTL, KDSA, JP), and Writing – Review & Editing (all).

Data and software availability

The Tethysbaena scabra genome project is integrated into the Catalan Initiative for the Earth BioGenome Project (CBP), and all raw data and assembly were deposited in European Nucleotide Archive: Tethysbaena scabra. Accession number PRJEB61927; https://identifiers.org/ena.embl/PRJEB61927. Raw data and assembly accession identifiers are reported in Table 3.

Acknowledgements

We are thankful to the bioinformaticians Jessica Gómez-Garrido and Tyler Alioto (Centre Nacional d’Anàlisi Genòmic, CNAG) and Emilio Righi (Centre for Genomic Regulation, CRG), both in Barcelona (Spain), for their invaluable assistance.

References

Astashyn

Tvedte

Sweeney

: Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 2024;25(1):60. 38409096

10.5281/zenodo.10651084

PMC10898089

Challis

Richards

Rajan

: BlobToolKit–interactive quality assessment of genome assemblies. G3: Genes, Genomes. Genetics. 2020;10(4):1361–1374. 32071071

10.1534/g3.119.400908

PMC7144090

Cheng

Concepcion

Feng

: Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods. 2021;18(2):170–175. 33526886

10.1038/s41592-020-01056-5

PMC7961889

Donath

Jühling

Al-Arab

: Improved annotation of protein-coding genes boundaries in metazoan mitochondrial genomes. Nucleic Acids Res. 2019;47(20):10543–10552. 31584075

10.1093/nar/gkz833

PMC6847864

Elphinstone

Todesco

: RepeatOBserver: Tandem Repeat Visualisation and Putative Centromere Detection. Mol. Ecol. Resour. 2025; e14084. 10.1111/1755-0998.14084

Guan

McCarthy

Wood

: Identifying and removing haplotypic duplication in primary genome assemblies. Bioinformatics. 2020;36(9):2896–2898. 31971576

10.1093/bioinformatics/btaa025

PMC7203741

Harry

: PretextView (Paired REad TEXTure Viewer): A desktop application for viewing pretext contact maps. 2022. Reference Source

Manni

Berkeley

Seppey

: BUSCO: assessing genomic data quality and beyond. Curr. Protoc. 2021;1:e323. 34936221

10.1002/cpz1.323

Lawniczak

Durbin

Flicek

: Standards recommendations for the earth BioGenome project. Proc. Natl. Acad. Sci. 2022;119(4):e2115639118. 35042802

10.1073/pnas.2115639118

PMC8795494

Pronk

Medema

: Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure. Microb. Genom. 2022;8:000823. 35503723

10.1099/mgen.0.000823

PMC9465069

Ranallo-Benavidez

Jaron

Schatz

: GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 2020;11(1):1432. 32188846

10.1038/s41467-020-14998-3

PMC7080791

Rao

Huntley

Durand

: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159(7):1665–1680. 25497547

10.1016/j.cell.2014.11.021

PMC5635824

Rhie

Walenz

Koren

: Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020;21:227–245. 32928274

10.1186/s13059-020-02134-9

PMC7488777

Schomaker-Bastos

Prosdocimi

: mitoMaker: a pipeline for automatic assembly and annotation of animal mitochondria using raw NGS data. Preprints. 2018. 10.20944/preprints201808.0423.v1

Simão

Waterhouse

Ioannidis

: BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–3212. 10.1093/bioinformatics/btv351

Smit

AFA

Hubley

Green

: RepeatMasker Open-4.0 [Software]. 2013–2015. Reference Source

Uliano-Silva

Ferreira

JGR

Krasheninnikova

: MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics. 2023;24(1):288. 37464285

10.1101/2022.12.23.521667

PMC10354987

Vurture

Sedlazeck

Nattestad

: GenomeScope: fast reference-free genome profiling from short reads. Bioinformatics. 2017;33(14):2202–2204. 28369201

10.1093/bioinformatics/btx153

PMC5870704

Zhou

McCarthy

Durbin

: YaHS: yet another Hi-C scaffolding tool. Bioinformatics. 2023;39(1):btac808. 36525368

10.1093/bioinformatics/btac808

PMC9848053

10.5256/f1000research.188221.r415498

Reviewer response for version 3

Schwentner

Martin

1 Referee 1Naturhistorisches Museum Vienna (Austria), Vienna, Austria

Competing interests: No competing interests were disclosed.

29 9 2025

2025

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve

I think the manuscript is ready for acceptance, the authors have responded to all raised issues and altered all relevant sections

Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Yes

Are the rationale for sequencing the genome and the species significance clearly described?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

genomics, systematics, evolutionary research, crustacea

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.5256/f1000research.183424.r406386

Reviewer response for version 2

Schwentner

Martin

1 Referee 1Naturhistorisches Museum Vienna (Austria), Vienna, Austria

Competing interests: No competing interests were disclosed.

4 9 2025

2025

recommendation

approve-with-reservations

The authors present the first genome of the thermosbaenacean, which will be an important resource for future research. The overall manuscript is well written and structured and the methods and results are appropriate and well presented.

I have a few comments that will help to clarify some issues.

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

2. I tried to download the genome. Maybe I have missed it, but I could not find the whole genome (I was only able to download the first scaffold) and could not find a gene file or similar. Please make sure that all data is available

3. The contamination check with Blobtools had a strong impact as it removed ~770 of the 820 scaffolds. Due to its impact, it would be very important and helpful if the actual value used were described. Currently only the metrices are named (GC, coverage, BUSCO), but not the specific values or ranges employed.

4. I find a bit difficult to follow the reported numbers of scaffolds and contigs, and if I understand the text correctly, the final number of contigs and scaffold does not quite match those from the beginning. 993 contigs were assembled into 821 scaffold (thus most scaffolds include only one contig). Only 59 scaffolds were retained after Blobtools filtering and the final set is 322 contigs in 23 scaffolds. That should not be possible, even the 59 scaffolds after Blobtools should not hold more then ~200 contigs. Maybe I did not fully understand the numbers, the authors should make this procedure clearer

Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Yes

Are the rationale for sequencing the genome and the species significance clearly described?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

genomics, systematics, evolutionary research, crustacea

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Pons

Joan

Animal&Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, Illes Balears, Spain

Competing interests: No competing interests were disclosed.

9 9 2025

Here are the response of each question:

1. The authors described in the Amendments that they now report 299 standardized 100 bp gaps, but I could not find this information in the actual manuscript.

The information about the 299 standardized 100 pb gaps was included in Table 2. The new version also includes that information in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).”.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA from where user can go to right panel and click on Related ENA Records and find links to the results of the project:

Result Count

Experiment (PACBIO, HiC, and RNAseq reads) 3

Assembly 1

Genome assembly contig set 1

Sequence (By Chromosome) 18

Run ( PACBIO, HiC, and RNAseq reads) 3

Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

GC, length, and coverage after cut-off in Blobtools are indicated in Figure 4. Here we summarize the cut-off filter values implemented in the two rounds of Blobtools:

Fisrt round in Blobtoolkit:

GC_MIN = 0.320

GC_MAX = 0.351

SORTED_ALIGNMENT_COVERAGE_MIN = 36.320

SORTED_ALIGNMENT_COVERAGE_MAX = 680.661

LENGTH_MIN = 10000

Second round in Blobtoolkit:

GC_MIN = 0.330

GC_MAX = 0.340

SORTED_ALIGNMENT_COVERAGE_MIN = 50.000

SORTED_ALIGNMENT_COVERAGE_MAX = 170.000

LENGTH_MIN = 10000

We replaced the results shown in Figure 4, which correspond to the first round of filtering in BlobTools, with those obtained after the second round.

Our previous explanation was confusing, so we provide a new wording in the main text: “The contact map was curated using Pretext (Harry, 2022), which suggested connections between scaffolds and reduced the final assembly from 59 to 17 scaffolds, while retaining 229 scaffolds of unknown size (represented as 100 consecutive Ns in the FASTA file).

10.5256/f1000research.183424.r396911

Reviewer response for version 2

Angst

Pascal

1 Referee https://orcid.org/0000-0002-8654-2251 1University of Basel, Basel, Switzerland

Competing interests: No competing interests were disclosed.

22 8 2025

2025

recommendation

approve

The revised manuscript effectively addresses the comments I made as a reviewer. I appreciate the improvements made and the attention given to the issues I raised. Most of the revisions are clear and well implemented. However, I would be interested in receiving some brief clarification on a few of the changes. Could the authors please elaborate on the following points:

The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

“In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

“The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Partly

Are the rationale for sequencing the genome and the species significance clearly described?

Partly

Are the protocols appropriate and is the work technically sound?

Partly

Reviewer Expertise:

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Pons

Joan

Animal&Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, Illes Balears, Spain

Competing interests: No competing interests were disclosed.

9 9 2025

We reply to each question:

1) The accession added, “SAMEA11313135”, is titled “COVID-HUB-PL deep NGS sequencing of SARS-CoV-2 genomes” in the ENA. Should this be “SAMEA113414145”? – For “SAMEA118091338”, the associated Hi-C data seems to be unavailable on the ENA website.

The biosample to obtain PACBIO data is SAMEA113414145. Sorry for the mistake. We fix the error on the new version. The biosample for HiC data is SAMEA118091338.

The link in the manuscript https://identifiers.org/ena.embl/PRJEB61927 points to https://www.ebi.ac.uk/ena/browser/view/PRJEB61927 in ENA. Then, in right panel the user can click on Related ENA Records and find links to all results of the project:

Result Count

Experiment (PACBIO, HiC, and RNAseq reads) 3

Assembly 1

Genome assembly contig set 1

Sequence (By Chromosome) 18

Run ( PACBIO, HiC, and RNAseq reads) 3

Raw data, PACBIO, HiC and RNAseq reads, are also available downloaded from NCBI under the Bioprojet PRJEB61927

2) “In also returns a predicted centromere location for each chromosome.” “It also […]”? Further, did you find any centromeres or centromere-characteristic repeats?

We did not analyze the sequences of centromere repeats. We just annotated them.

3) “The genome annotation was assessed using BUSCO obtaining: C:93,1% [S:73,2%, D:19,9%], F:2,2%, M:4,7%, also 27004 transcripts and 22834 genes. RNAQuast has been performed to check the average alignment length, being 1248,6bp.” This is the first time that transcripts, genes, and RNA are mentioned. Is this related to the RNA-seq in PRJEB61927? Was gene annotation performed? If so, how was it done, and where can the annotations be found?

We performed a preliminary of the RNA-seq data from another sample (biosample SAMEA117791495) to obtain a preliminary number of coding genes. The workflow consists of several steps. In brief, Illumina paired-end adapters (Nextera PE) were removed from the raw reads using Trimmomatic v0.39. Next, the reference genome assembly was soft-masked with dustmasker, and an index was generated with hisat2-build. The trimmed reads were then aligned to the indexed genome using hisat2, and the resulting alignments were processed with samtools. Finally, gene annotation was performed with BRAKER v3, using a protein database retrieved from OrthoDB as external evidence. RNA-seq data is available in the project ( https://identifiers.org/ena.embl/PRJEB61927 ). Here is a direct link

https://www.ebi.ac.uk/ena/browser/view/ERX14107725 . We are still working to make a full annotation in the near future once new funding is obtained.

10.5256/f1000research.177497.r376178

Reviewer response for version 1

Angst

Pascal

1 Referee https://orcid.org/0000-0002-8654-2251 1University of Basel, Basel, Switzerland

Competing interests: No competing interests were disclosed.

16 4 2025

2025

recommendation

approve-with-reservations

This genome note presents the genome of Tethysbaena scabra. The authors sampled two pools of specimens for sequencing using PacBio HiFi and Hi-C technologies. They used latest software for assembly of sequencing reads and for assessing the assembly’s quality, completeness, and contamination level. They discussed potential sources of contamination and they separated target versus non-target, contaminant contigs based on specialized software. Generally, this article is sound, but I identified a few inconsistencies and lack of detail which I would like the authors to address. I also expected more details on the sequence (variation) of the genome from a genome note, e.g., a summary of the repeat content and other annotations.

Details on the mandatory reviewer questions:

The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. Clearly, sequencing a single individual would be preferable. Related to this, the methods state that two pools of 20 specimens were sampled. However, the results state that “identical” pools were used. The word identical is confusing in that context, because in the methods the pools are described as separate pools. Also, on NCBI, there is only one BioSample (a batch of 20 individuals) registered and is linked to the Illumina and the PacBio sequencing, which is not what is described in the article. The BioSample should either be a batch of 40 individuals or there should be two BioSamples of 20 individuals each.

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used.

It is not clear how many contigs remained after each step in the methods. What was the number of contigs after Hi-C scaffolding? In the method is says 821. It also says 59 contigs were obtained after applying BlobToolKit. Are the other 762 (821 - 59) contigs from contaminants? How does that align with the 322 contigs mentioned in Table 2 and the 17 scaffolds mentioned in the Results?

The genome is available from NCBI. However, there are no annotations provided. At least a description of the repetitive content would be valuable. Repetitive content seems to have been assessed since assemblies were "preprocessed to mask repetitive and low-complexity regions".

Additional aspects:

The completeness of the assembly was assessed using BUSCOs and k-mers. Given the chromosome-level assembly, it would be valuable to describe the sequence content and arrangement, and the structural variation. For example, what is the telomeric end repeat motif; what characterizes the centromeres (GC content, sequence content, repeat content); what is the distribution of repeat versus genic content?

It would be important to know the average read lengths or read length N50s.

The keywords should include the full species name.

Are there assembly gaps? What is their size?

There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny.

Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?

Partly

Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?

Partly

Are the rationale for sequencing the genome and the species significance clearly described?

Partly

Are the protocols appropriate and is the work technically sound?

Partly

Reviewer Expertise:

Pons

Joan

Animal&Microbial Biodiversity, Institut Mediterrani d'Estudis Avancats, Esporles, Illes Balears, Spain

Competing interests: No competing interests

6 6 2025

The main rationale presented for sequencing the genome was the Catalan Initiative for the Earth BioGenome Project (CBP). It would thus be nice to have a short explanation of what that is and what its significance is. ** We added additional information as suggested. “The Catalan Biogenome is EBP-affiliated project network with the objective of sequencing the genome of more than 40000 eukaryotic species living in the Catalan Linguistic Area (such as Balearic Islands)”.

The protocols and the work seem technically sound but would profit from more details. For example, it is mentioned that pools of specimens were used for sequencing but not why this was done. ** We agree that our wording was confusing so we rewrote the text to clarify the issue “Extraction of high molecular weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA113414145 qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional de Seqüenciació Genòmica (CNAG), Barcelona, Spain."

Details on DNA extraction and library preparations are missing. To reproduce the work or to apply it to other systems, it is necessary to know what kits, reagents, and protocols were used. ** For the HIFI sequencing, a DNA library was prepared using the PacBio SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences of California, CA, USA) following the official protocol. Approximately 300 ng of high-quality genomic DNA was sheared to ~15–20 kb, repaired, and ligated with SMRTbell adapters to create circular molecules. The library was size-selected to remove fragments smaller than 1,000 bp, purified with AMPure PB beads, and quality-checked using Qubit and TapeStation. Finally, it was sequenced on the PacBio Sequel II platform in CCS mode to generate highly accurate HiFi reads suitable for genome assembly and analysis. These previous steps were performed in the University of Delaware (USA). ** Hi-C libraries were prepared using the Omni-C Dovetail protocol (Cantara Bio, CA, USA). Briefly, chromatin was extracted from frozen tissue and crosslinked with formaldehyde to preserve the native three-dimensional genome architecture by stabilizing DNA-protein and DNA-DNA interactions within the nucleus. The crosslinked chromatin was then fragmented using DNase I, and spatially proximal DNA ends were ligated to capture physical interactions between genomic regions, providing long-range linkage information useful for validating and scaffolding genome assemblies. Hi-C library preparation and sequencing were performed at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.

To replicate the assembly process and subsequent assembly modification the parameters used for the software are missing. Especially, what parameters were used for hifiasm? Given it was not designed for assembly of multiple specimens’ genomes, were the parameters adjusted accordingly? It seems hifiasm had issues in the collapsing step since the authors applied purge_dups, which lead to a great reduction in the number of contigs. It is normally not desired to purge haplotigs from hifiasm assemblies and if applied does not result in such a drastic reduction in the number of contigs. I missed discussion of all of this. ** The number of haplotypes was considered using the --nhap parameter in hifiasm. Given that our species is presumably diploid, as estimated by Smudgeplot, and the sequencing pool i ncluded 20 individuals, the parameter was set to nhap=40. However, d ue to the presence of multiple individuals, hifiasm may struggle to accurately resolve haplotypes, which may necessitate the use of purge_dups to remove redundant contigs. ** We suspect that most of the duplications are due to the DNA being sourced from a pool of 20 individuals, as a single individual did not provide enough material to construct a HiFi library. We clarify this question in the main text “The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Larger number of Haplotypic duplications presumably caused by the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs". We would like to point out that final genome size after purging duplicates and removing contamination matched the size initially predicted by Genomescope.

“individuals that were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants”. Why were the individuals not cleaned before assembly despite that this is known to cause assembly issues? ** Specimens were isolated from well water and individually collected to minimize contamination from other macroscopic species. However, this approach did not prevent contamination by microscopic organisms.

"The coverage obtained has not been sufficient to deduce sex-linked chromosomes". Would this be a sensible analysis given the pools of specimens? Why is 53.8x coverage not enough? Is this the haploid sequence coverage? From Figure 4, the coverage seems twice as high. ** Several factors hindered the identification of sex chromosomes in our diploid species. Most prominently, the characteristic haploid coverage pattern typically associated with sex chromosomes was absent. Furthermore, the genome assembly and scaffolding were performed using two separate DNA pools without prior knowledge of the individuals’ sex, complicating the detection of sex-specific sequences. In addition, the lack of biological information on the species and genus—particularly whether sex determination is chromosomal or genetic—further limits the applicability of standard methods for identifying sex chromosomes.

It would be important to know the average read lengths or read length N50s. ** The read length N50 of PacBio raw reads has been added to the results section.

The keywords should include the full species name. ** The species name has been added to the keywords section.

Are there assembly gaps? What is their size? ** The final assembly contains 299 gaps, each 100 bp in length. This is due to the default behavior of tools such as Hifiasm, YAHS, and PretextMap, which insert standardized 100 bp gaps when the true gap size cannot be determined.

There is no phylogenetic analysis. I suggest including one or refer to a previous solid (whole genome) phylogeny. ** We appreciate the suggestion to include a phylogenetic analysis. However, given the current lack of available genomic data from other species in the Pericarid crustacean order, we believe that conducting a robust phylogenetic analysis at this stage would not be sufficiently reliable. We agree that such an analysis would be highly valuable, particularly once more genomic data from related taxa becomes available, and it is a future goal of our research group.