Keywords
Thermosbaenacea, anchialine environment, stygobiont species, Tethysbaena scabra
This article is included in the Genomics and Genetics gateway.
We present a genome assembly of Tethysbaena scabra (Arthropoda; Crustacea; Malacostraca; Eumalacostraca; Peracarida; Thermosbaenacea; Monodellidae), a species endemic to Mallorca, Spain. The genome size is 1.18 gigabases that is scaffolded into 17 chromosomes plus a mitochondrial genome of 16,5 kilobases in length.
Thermosbaenacea, anchialine environment, stygobiont species, Tethysbaena scabra
In response to the reviewer’s comments, we have made the following revisions and additions to improve the clarity, completeness, and reproducibility of our genome note:
1. Project Context: We added a brief explanation of the Catalan Biogenome Project and its connection to the Earth BioGenome Project (EBP).
2. Sample Pools and Rationale: The methods section now clearly explains why pooled specimens were used for HiFi and Hi-C sequencing, including accession numbers and sequencing facilities.
3. Library Preparation Details: A brief explanation of protocols, reagents, and kits used for both HiFi and Hi-C library preparation have been included.
4. Assembly Parameters: We now specify the use of --nhap=40 in hifiasm and justify this setting based on diploidy and pooled individuals. We also explain the rationale for applying purge_dups and report the reduction in contig number.
5. Contamination Control: We clarified our approach to minimizing macroscopic contamination and acknowledged limitations in avoiding microscopic contaminants.
6. Contig and Scaffold Counts: The methods section was revised to provide a clearer description of the filtering steps and contig/scaffold numbers at each stage.
7. Sex Chromosome Detection: We expanded on why sex chromosome identification was not feasible, citing the absence of haploid coverage patterns and unknown sex determination mechanisms.
8. Repeat Annotation: A new table was added summarizing the chromosomal distribution and sequence composition of repetitive elements.
9. Read Metrics: The read length N50 of PacBio reads was added to the Results.
10. Gaps: We now report that the assembly contains 299 standardized 100 bp gaps.
11. Keywords and Phylogeny: The species name was added to the keywords. A phylogenetic analysis was not included due to lack of comparative genomic data, but we acknowledge its value for future work
To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.
Tethysbaena scabra (Pretus, 1991) (NCBI:txid203899) is a thermosbaenacean (Crustacea; Multicrustacea; Malacostraca; Eumalacostraca; Peracarida; Thermosbaenacea; Monodellidae), a relict group of peracarid crustaceans characterized by the display in gravid females of a dorsal brood pouch formed by a posterior extension of the carapace (Figure 1). This species measures 2–3 mm in length and is completely eyeless and depigmented, inhabiting subterranean waters of raised salinity in caves and wells located near the marine coast. It is endemic to the Mediterranean islands of Mallorca and Menorca (Balearic Archipelago). Its feeding habits correspond to those of a particle collector, thriving primarily in the pycnoclines that develop within the water column of anchialine caves, where organic debris, bacteria, and fungi accumulate. There is no available information on genome size and chromosome number in thermosbaenaceans. The closest taxa with known information on genome size (https://www.genomesize.com, 1C values in pg) are within the peracarid groups Isopoda (1.70-8.60); Amphipoda (0.52-64.62); and Mysida (10.81-12.00).
The genome sequence from T. scabra will help to study adaptation to underground environments, particularly anchialine ones, that are characterized by oligotrophy, darkness and salinity. The genome of T. scabra was sequenced under the umbrella of the Catalan Initiative for the Earth BioGenome Project (CBP). Here we present a chromosome-level genome assembly for T. scabra from Mallorca, Spain, which represents the first reference genome for the order Thermosbaenacea.
Specimens were collected in late Spring 2022 with a modified plankton net from the bottom of a well in an old windmill at Es Pil·larí, Palma, Mallorca, Spain (39.533831, 2.747581). Specimens were sorted out under a stereo-microscope (Figure 2). Several batches of 20 specimens each were placed in a cryovial for snap-freezing in liquid nitrogen, and ulteriorly sent in dry ice to the sequencing facilities. Specimens were collected and identified by Damià Jaume. Extraction of High Molecular Weight DNA, construction of Pacific Biosciences HiFi circular consensus DNA sequencing libraries, and sequencing on Pacific Biosciences SEQUEL II (HiFi) instrument was performed by Delaware Biotechnology Institute, University of Delaware (DE, USA) using a pool of 20 specimens (Accession number: SAMEA11313135, qmTetScab1). Hi-C data was generated from another pool of 20 individuals from the same collection site (Accession number: SAMEA118091338) using the library preparation Omni-C DNA and sequenced 2 x 150 pb on the Illumina NovaSeq 6000 S4 instrument at the Centre Nacional d’Anàlisi Genòmica (CNAG), Barcelona, Spain.
The genome size was estimated using GenomeScope2 (Vurture et al., 2017), and diploidy was confirmed with Smudgeplot (Ranallo-Benavidez et al., 2020). Assembly was conducted using hifiasm (Cheng et al., 2021) with n_hap=40 (considering diploidy and 20 individuals). Large number of haplotypic duplications presumably caused the high number of specimens used for DNA extraction were withdrawn with purge_dups (Guan et al., 2020), passing from 2208 to 1272 contigs. Genomic DNA was extracted from individuals whose size is smaller than 5 mm, therefore they were not externally cleaned so it could also contain DNA from microbial and other eukaryote contaminants. Hence, contig sequences from contaminant species were removed from assembly using two bioinformatic tools, Foreign Contamination Screen (FCS, Astashyn et al., 2024), and Whokaryote (Pronk and Medema, 2022), obtaining 993 contigs. The former achieves this by aligning assemblies, preprocessed to mask repetitive and low-complexity regions, to a curated reference database. The pipeline segments scaffolds into 100-kb subsequences and employs hashed k-mers as alignment seeds. Sequences assigned to taxonomic groups distinct from the query organism (NCBI:txid203899) were then excluded. The latter is a computational tool that differentiates eukaryotic from prokaryotic contig sequences based on fundamental differences in gene structure between the two taxonomic domains. It utilizes a Random Forests approach in combination with Tiara predictions, which incorporate k-mer frequency distributions as classification feature. The assembly was scaffolded with Hi-C data (Rao et al., 2014) using YaHS (Zhou et al., 2023), obtaining 821 scaffolds. The assembly was checked for contamination with two rounds of Blobtools, to ensure complete decontamination, obtaining 59 scaffolds. FCS and Whokaryote removed very few sequences compared to BlobToolKit because the first ones only use a close taxon reference, not available in Thermosbaenacea, and gene structure and domains, while the latter is based on several features (GC content, coverage, BUSCO reference, etc.). Curation of contact map was performed using Pretext (Harry, 2022). A final assembly is obtained with 322 contigs, structured in 23 scaffolds, which present contact patterns in the central regions, suggesting a connection between the scaffolds, which ultimately allows for a total of 17 scaffolds to be obtained. Putative sex chromosomes have not been identified, likely due to the genomic material being sourced from a pool of 20 individuals of unknown sex, and the Hi-C data being derived from a separate pool of specimens. Additionally, the coverage obtained has not been sufficient to deduce sex-linked chromosomes. The genome was analysed within the BlobToolKit environment and BUSCO scores were generated (Challis et al., 2020). Table 1 list the software tool versions used, where appropriate. To assess the assembly metrics, the k-mer completeness and QV consensus quality values were calculated using Meryl and Merqury (Rhie et al., 2020).
Software tool | Version | Source |
---|---|---|
Blastn | 2.12.0+ | https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html |
BlobToolKit | 4.3.5 | https://github.com/blobtoolkit/blobtoolkit |
BUSCO | 5.5.0 | https://gitlab.com/ezlab/busco/-/archive/5.5.0/busco-5.5.0.zip |
FCS | 0.5.3 | https://github.com/ncbi/fcs |
GenomeScope2 | 2.0 | https://github.com/tbenavi1/genomescope2.0 |
Hifiasm | 0.20.0-r639 | https://github.com/chhylp123/hifiasm |
Merqury | 1.3 | https://github.com/marbl/merqury |
Meryl | 1.4.1 | https://github.com/marbl/meryl |
PretextMap | 0.1.9 | https://github.com/sanger-tol/PretextMap |
RepeatMasker | 4.1.7 | https://github.com/Dfam-consortium/RepeatMasker |
RepeatOBServer | 1.0 | https://github.com/celphin/RepeatOBserverV1 |
Purge_dups | 1.2.5 | https://github.com/dfguan/purge_dups |
Smudgeplot | 0.3.0 | https://github.dev/KamilSJaron/smudgeplot |
Whokaryote | 1.1.2 | https://github.com/LottePronk/whokaryote |
YaHS | 1.2 | https://github.com/c-zhou/yahs |
The assembly of mitochondrial genome failed using MitoHiFi (Uliano-Silva et al., 2023), likely due to lack in genome databanks of a mitogenome sequence of sufficiently close taxa. For this reason, sequence contigs were compared with a relaxed BLASTn algorithm against a database built with mitogenome sequences of several peracarid species. The sequence of 30 kb with a positive match was circularized in MitoMaker (Schomaker-Bastos and Prosdocimi, 2018), and annotated in Mitos2 (Donath et al., 2019).
Repetitive annotation was performed using RepeatMasker (Smit et al., 2013–2015) and RepeatOBserver (Elphinstone et al., 2025). The former tool identifies DNA low complexity regions as well as interspersed repeats. In contrast, RepeatOBserver describes tandem repeats and cluster of transposons found on a chromosome level assembly, based in repeat patterns. In also returns a predicted centromere location for each chromosome.
The genome sequence was obtained from a DNA pool of 20 specimens of T. scabra for HiFi data, plus another identical pool for Hi-C data, from individuals collected in a well in Es Pil·larí, Palma, Mallorca, Spain. Two Pacific Biosciences sequencing cells yielded a total of 63.5 giga bases of high-fidelity (HiFi) long reads with a N50 of 13,270 bp, achieving a coverage of 53.8X. Afterward, primary contig assemblies were scaffolded using 73.9 Gb of paired-end Illumina reads derived from chromosome conformation Hi-C data. Manual curation corrected 39 misassemblies, including missing joins and missjoins, resulting in a 0.28% reduction in the total assembly length, a 61.02% decrease in scaffold count, and an 89.99% increase in scaffold N50. The final genome assembly spans 1.18 Gb across 23 scaffolds, with a scaffold N50 of 74.6 Mb (Figure 3, Table 2). GC-coverage (Figure 4) and cumulative sequence plots (Figure 5) from BlobToolKit showed minimal parameter variation with few outliers, and only a very low fraction of sequences failed to match Arthropoda ones deposited in databases. Most of the assembly sequence (99.2%) has been mapped to the final chromosomes. The final assembly sequence confirmed by Hi-C data was assigned to 17 chromosomal-level scaffolds that are designated as they appear in the PretextMap (Figure 6; Table 3). The assembly has a BUSCO v5.5.0 (Manni et al., 2021; Simão FA et al., 2015) completeness of 94.7% (single 93.7%, duplicated 0.7%) using the arthropoda_odb10 reference set. The mitochondrial genome contig can be found within the multifasta file of the genome submission.
This snailplot generated by BlobToolKit displays several metrics, including the longest scaffold, N50, and BUSCO gene completeness, among others. The main plot is segmented into 50 bins, ordered by size around the circumference, with each bin representing 2% of the 1.18 Gbp assembly. Scaffold length distribution is shown in dark grey, with the plot radius scaled to the length of the longest scaffold in the assembly (104 Mbp). Orange and light-orange arcs indicate the N50 and N90 scaffold lengths (74.6 Mbp and 55.4 Mbp, respectively). A pale grey spiral illustrates the cumulative scaffold count on a log scale, with white scale lines marking successive orders of magnitude. The blue and pale-blue areas along the plot's outer edge depict the GC, AT, and N content distribution across these bins. A summary of the BUSCO results appears in the figure’s top right corner.
Assembly metrics benchmarks are adapted from the 6.C.Q40 of Earth Biogenome Project from (Lawniczak et al., 2022). BUSCO scores based on the arthropoda_odb10 BUSCO set using v5.5.0. C = complete, [S = single copy, D = duplicated], F = fragmented, M = missing, n = number of orthologues in comparison.
Scaffolds are shown by phylum. Circles are sized in proportion to scaffold length. Histograms show the distribution of scaffold length sum along.
The gray line represents the cumulative length of all scaffolds, while the colored lines indicate the cumulative lengths of scaffolds assigned to each individual phylum.
Chromosomes are shown as they appear in PretextMap, not by size order.
https://www.ebi.ac.uk/ena/browser/view/GCA_964277195.1?show=chromosomes.
Accession | Name | Length (Mb) | GC% |
---|---|---|---|
OZ195310 | tros_1 | 83.11 | 33.29 |
OZ195311 | tros_2 | 104.46 | 33.18 |
OZ195312 | tros_3 | 85.72 | 33.29 |
OZ195313 | tros_4 | 82.79 | 33.44 |
OZ195314 | tros_5 | 87.20 | 33.33 |
OZ195315 | tros_6 | 74.56 | 33.45 |
OZ195316 | tros_7 | 74.67 | 33.31 |
OZ195317 | tros_8 | 72.98 | 33.51 |
OZ195318 | tros_9 | 61.70 | 33.58 |
OZ195319 | tros_10 | 49.14 | 33.44 |
OZ195320 | tros_11 | 56.67 | 33.72 |
OZ195321 | tros_12 | 70.05 | 33.68 |
OZ195322 | tros_13 | 55.35 | 33.43 |
OZ195323 | tros_14 | 59.37 | 33.46 |
OZ195324 | tros_15 | 57.12 | 33.76 |
OZ195325 | tros_16 | 55.68 | 33.69 |
OZ195326 | tros_17 | 45.10 | 33.69 |
OZ195327 | MT | 0.016 | 32.04 |
The genome annotation was assessed using BUSCO obtaining: C:93.1% [S:73.2%, D:19.9%], F:2.2%, M:4.7%, also 27,004 transcripts and 22,834 genes. RNAQuast has been performed to check the average alignment length, being 1248.6 bp. Repetitive regions are summarized in Table 4.
Conceptualization (JP, CJ, DJ, JAJR), Data Curation (KDSA, LTL, JP), Formal Analysis (LTL, KDSA, JP), Funding Acquisition (JAJR, JP), Resources (DJ), Writing – Original Draft Preparation (LTL, KDSA, JP), and Writing – Review & Editing (all).
The Tethysbaena scabra genome project is integrated into the Catalan Initiative for the Earth BioGenome Project (CBP), and all raw data and assembly were deposited in European Nucleotide Archive: Tethysbaena scabra. Accession number PRJEB61927; https://identifiers.org/ena.embl/PRJEB61927. Raw data and assembly accession identifiers are reported in Table 3.
We are thankful to the bioinformaticians Jessica Gómez-Garrido and Tyler Alioto (Centre Nacional d’Anàlisi Genòmic, CNAG) and Emilio Righi (Centre for Genomic Regulation, CRG), both in Barcelona (Spain), for their invaluable assistance.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genomics, systematics, evolutionary research, crustacea
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome assembly and annotation | Wet Lab (DNA/RNA extraction and library preparation) | Short- and Long-read sequencing technologies | Evolutionary Biology | Population Genetics | Metapopulation Ecology | Host-Parasite Interactions | Genome Evolution | Environmental Science
Are the rationale for sequencing the genome and the species significance clearly described?
Partly
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome assembly and annotation | Wet Lab (DNA/RNA extraction and library preparation) | Short- and Long-read sequencing technologies | Evolutionary Biology | Population Genetics | Metapopulation Ecology | Host-Parasite Interactions | Genome Evolution | Environmental Science
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 04 Jul 25 |
read | read |
Version 1 14 Mar 25 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)