Keywords
Whole genome sequencing, genome assembly, genome annotation, slender walking catfish, Prophagorus nieuhofii.
This article is included in the Genomics and Genetics gateway.
The slender walking catfish, Prophagorus nieuhofii, plays an important role in small-scale fisheries across Southeast Asia, supporting food security. While IUCN currently lists it as a Least Concern species, growing demand and pressures such as overfishing, habitat loss, and degradation may elevate its conservation risk. To support sustainable fisheries management and aquaculture, we sequenced, assembled, and annotated the whole genome of this species. The specimen was part of an expedition to document and preserve the genetic resources of aquatic animals in Kalimantan’s freshwater ecosystems. Using 27 Gb of sequence data, we assembled a 1.1 Gb genome comprising 5,790 scaffolds. This genome assembly has high contiguity and completeness, with N50 of 33.7 Mb and a BUSCO score of 98.8%. Repeat annotation revealed that 48.17% of the genome consisted of repetitive elements, predominantly DNA transposons (18.56%) and retroelements (13.30%). Structural annotation identified 30,099 protein-coding genes and 37,734 transcripts, most of which were multi-exonic and rich in alternative splicing. BUSCO analysis confirmed the high completeness of the genome and annotation, with 97.7% of the conserved orthologs being detected.
Whole genome sequencing, genome assembly, genome annotation, slender walking catfish, Prophagorus nieuhofii.
This revised version incorporates amendments in response to the reviewer’s suggestions and recommendations. The title has been revised to “Reference-guided draft genome assembly of the slender walking catfish, Prophagorus nieuhofii”. Regarding data availability, the draft genome annotation has been made available and deposited in the Zenodo repository (https://doi.org/10.5281/zenodo.17422526) as underlying data. Additionally, Figure 1 and Figure 2 have been removed from the main text and deposited in the Zenodo repository as extended data (DOI: https://doi.org/10.5281/zenodo.17429587). Consequently, the Data availability section has been amended to include a statement reflecting the deposition of the genome annotation (underlying data) and figures such as snail graph and BUSCO results as (extended data), along with their respective DOI. The reference section has also been updated with two new references to cite this underlying and extended data. Furthermore, a substantial amendment has been made to Tables 3 and 4. They have been consolidated into comparative tables, which now include a comparison with the genome assemblies of four other catfish species: Clarias gariepinus (Clariidae), Ictalurus furcarus (Ictaluridae), Ictalurus punctatus (Ictaluridae), and Tachysurus fulfidraco (Bagridae). To reflect these table changes, the related texts in the ”Structural and Functional Annotation” subsection have also been revised to enhance the comparative description. Finally, a correction has also been made in the Acknowledgment section to fix the name of the sequencing and computing facility.
See the authors' detailed response to the review by Geoffrey C. Waldbieser
See the authors' detailed response to the review by Gulab D. Khedkar
The slender walking catfish, Prophagorus nieuhofii (previously known as Clarias nieuhofii), is widespread in Southeast Asia, including Indonesia –specifically Java, Sumatra, and Kalimantan–, the Malay Peninsula, Singapore, Thailand, and the Philippines.1 It is a popular food fish due to its good taste and nutritional benefits and is an important species for food security by supporting artisanal fisheries. While the IUCN Red List of threatened species classifies it as Least Concern in the global assessment of species conservation, habitat loss and degradation and fishing pressure have resulted in a decline in many natural populations.2–4 In Thailand, it has been classified as a vulnerable species,5 and a genetic assessment has been carried out to manage its natural populations.4
In addition to maintaining the sustainability of this fish population in its natural environment, several studies have been conducted to develop it into a farmed species. This species, owing to its air-breathing capability, resilience, and adaptability, shows significant potential for domestication and aquaculture. Preliminary studies on domestication and aquaculture have been conducted. These included the study of growth and survival during the early stages of domestication,6 breeding and reproduction7 and exploration as a probiotic source.8 Although further research is necessary to optimize its cultivation, its inherent characteristics are conducive to successful aquaculture.
Generation of the whole genome sequence of this species will provide a good resource for both fisheries management and aquaculture development. In the former case, the large discovery of single nucleotide polymorphisms (SNPs) that cover genome wide (neutral) and allele-specific (adaptive) diversity patterns will provide a good resource for genomic stock identification, traceability, fisheries-induced evolution and climate change.9 In the latter case, the whole genome sequence, combined with other technologies, such as quantitative trait loci (QTL) analysis, genome-wide association studies (GWAS), and expression profiling, allowing for the prediction of genotypic variants associated with phenotypic traits, can be used to improve traits in breeding programs.10,11
Fish samples were collected during a 2024 expedition aimed at characterizing genetic resources of aquatic animals from a natural population in South Kalimantan, Indonesia (3°21′43.0″S, 114°42′08.3″E). Specimen were captured using bubu traps and held in a pond with 60 cm water depth at 27-28°C for three days to reduce stress. Prior to DNA tissue sampling, fish were anesthetized following12: they were placed in a 35-liter bucket with 7 cm of water at 28°C, cooled with liquid ice to 21°C, and then clove oil was added at 160 mg/L. Tissue samples were collected when the fish showed minimal movement after anesthesia. A 10 mg tissue sample was collected from an individual measuring of 32.5 cm in length and weighing 277 g, preserved in DNA shield solution and transported to the laboratory for sequencing. High-quality DNA was extracted using the Quick-DNA high molecular weight (HMW) MagBead kit (Zymo Research) with overnight proteinase K digestion incubation. The DNA extract was quantified using a Qubit fluorometer with an Equalbit 1x ds-DNA HS assay kit for sequencing.
Whole genome sequencing was performed using Oxford Nanopore Technology (ONT) – PromethION. Genomic DNA (1500 ng DNA in 48uL nuclease free water was incubated at 20°C for 30 min, followed by incubation at 65°C for 5 min. Sequencing by ligation was performed using the Ligation Sequencing DNA V14 workflow kit (SQK-LSK114). The basecaller tool was Dorado v0.9.1, using dna_r10.4.1_e8.2_400bps_sup@v5.0.0 basecalling model, with a minimum Q score of 10 and trimming of adapters and barcodes. The quality of the sequencing data was checked using NanoPlot.13
Genome assembly estimation was done using Flye 2.9.5,14 while genome scaffolding was conducted with RagTag 2.1.015 guided by the reference genome of Clarias gariepinus (GCF_024256425.1). Genome size was estimated using Jellyfish software version 2.3.116 and further processed with GenomeScope 2.0 v2.0.1. The assembly statistics were calculated using assembly-stat version 1.0.1. The completeness of the assembly was estimated using Benchmarking Universal Single-Copy Orthologous (BUSCO) version 5.8.2, utilizing miniport.17–19
Repetitive elements within the genome assembly were identified using RepeatModeler v2.0.6 in conjunction with RepeatMasker v4.1.7 (http://www.repeatmasker.org). Prior to annotation, these repetitive regions were soft masked to minimize interference. Structural genome annotation encompassing gene prediction was conducted using the GALBA pipeline,20 which employs miniprot17 and AUGUSTUS,21 integrating protein data from closely related species as extrinsic evidence. Specifically, protein data from Clarias gariepinus (GCF_024256425.1), Ictalurus furcatus (GCF_023375685.1), Ictalurus punctatus (GCF_001660625.3), and Tachysurus fulvidraco (GCF_022655615.1) were utilized. Functional annotation of the resulting gene predictions was then performed using the ‘funannotate annotate’ command from the Funannotate pipeline (https://funannotate.readthedocs.io/en/latest/install.html), incorporating tools such as InterProScan5,22 Eggnog-Mapper,23 and SignalP 5.024 to assign gene names and predict protein functions. Finally, the completeness of the genome annotation was evaluated using BUSCO v5.8.2.19
This research was approved by the Ethics Commission for Animal Husbandry and Use, National Research and Innovation Agency (Approval No. 174/KE.02/SK/07/2024). All animal-related procedures were conducted in accordance with institutional guidelines and complied with the ARRIVE 2.0 reporting standards, the checklists for which are available at https://doi.org/10.6084/m9.figshare.29612615.v1.25
Sequencing produced a total of 27,388,841,658 bases from 4,440,560 reads, with 99.8% of bases meeting the designated quality standards. The highest observed mean basecall quality score was 46.4 with a read length of 133, while the longest read reached 6,129,425 with a mean basecall quality score of 14.5 ( Table 1). The draft of genome assembly, as illustrated in the snail graph (https://doi.org/10.5281/zenodo.17429587),26 comprises approximately 5,790 scaffolds, totaling 1.1 gigabases, with the longest scaffold of 54 spanning megabases.
| Total bases (bases) | 27,388,841,658.0 |
| Mean read length | 6,167.9 |
| Mean read quality | 20.2 |
| Median read length | 4,772.0 |
| Median read quality | 23.6 |
| Number of reads | 4,440,560 |
| Read length (N50) | 8,687.0 |
| STDEV read length | 5,796.8 |
The N50 and N90 values, measuring assembly continuity, are 33.7Mb and 20.6Mb, respectively. The base composition showed 39.5% GC content and 60.5% AT content, whereas the N content (gaps) remained minimal at 0.04%, indicating a highly contiguous and well-assembled genome. Using the Actinopterygii ortholog database, which is based on 3640 universal genes, the assembly demonstrated 98.8% completeness with a low percentage of missing BUSCO (1.15%), suggesting that most expected genes are present. The genome size of this species is similar to that of a related species, Clarias gariepinus, which has a genome size of 969.62 Mb and contig N50 of 33.71 Mb.27 Genome composition based on a 21-mer based characterization shows a heterozygosity rate of 0.78%, while the homozygosity rate was 99.12%.
Repeats annotation
Repeat annotation analysis revealed that approximately 48.17% of the genome (529,073,132 bp) consisted of repetitive elements ( Table 2). Among these, retroelements accounted for 13.30% of the genome, spanning over 146 million base pairs across 492,154 elements. This category includes SINEs, which comprise 1.93% of the genome, and LINEs, the largest subgroup of retroelements, which occupy 5.33%. The LINEs were mainly composed of L2/CR1/Rex elements (4.01%), followed by the R1/LOA/Jockey, RTE/Bov-B, and L1/CIN4 subfamilies. LTR elements were also prominent, comprising 6.04% of the genome, largely represented by Gypsy/DIRS1 (2.55%) and retroviral elements (1.05%), along with smaller contributions from BEL/Pao and Ty1/Copia. Notably, some retroelement families such as CRE/SLACS were not detected.
DNA transposons represented the largest category of repeats, both in number and genomic coverage, with 924,291 elements occupying 18.56% of the genome (203.9 million base pairs). Within this group, the TC1-IS630-Pogo family was predominant, covering 11.38% of the genome. Other notable contributors included hobo-Activator (3.40%), PiggyBac (0.41%), Tourist/Harbinger (0.64%), and MULE-MuDR (0.01%), while some families, such as En-Spm, showed no representation. Additionally, rolling-circle transposons comprising 30,262 elements and 0.56% of the genome were identified. A substantial portion of the genome (12.13%) contained unclassified elements, amounting to 986,211 entries. These may represent novel, divergent, or currently uncategorized repeat families.
Other repetitive elements included small RNA-related sequences (1.23%), simple repeats (e.g. microsatellites, 3.17%), low-complexity regions (0.31%), and satellite DNA (0.05%). Overall, interspersed repeats alone account for 44.03% of the genome (483.6 million base pairs), underscoring the genomic complexity and abundance of repetitive sequences, especially DNA transposons and retroelements.
Structural and functional annotation
The genome annotation of Prophagorus nieuhofii resulted in the identification of 30,099 protein-coding genes, a count comparable to that of Ictalurus punctatus (approx 31,040 genes), which produced 37,734 predicted transcripts. Relative to the other four catfish species examined ( Table 3), P. nieuhofii exhibits a compact gene architecture and reduced splicing complexity. This is evidenced by the lowest mean number of transcripts per gene at 1.3 (others range from 1.7 to 2.3), and alternative splicing detected in only 18.1% of genes, significantly less than the 32.2% to 48.0% observed in the remaining species. Furthermore, P. nieuhofii has a substantially higher proportion of single-exon genes (12.5%), far exceeding the 3.5% to 4.4% found in the other catfish, which possess a more uniformly multi-exonic architecture. Structural measurements confirm this compactness: P. nieuhofii genes have a smaller average locus length (15,993.4 bp vs. 19,155.7 bp to 22,083.0 bp) and fewer distinct exons per gene (8.9 vs. 11.3 to 12.3). Its mean exon size is also the smallest at 180.2 bp (others range from 269.4 bp to 327.1 bp), resulting in a significantly shorter average transcript size (1,812.4 bp).
In terms of genome composition ( Table 4), the P. nieuhofii genome has the lowest overall content dedicated to coding and genic regions. Exons constitute only 4% of the genome (48 Mb), which is notably low compared to the 8% to 13% seen elsewhere but exhibit the highest GC content at 51% (compared to 45%─46% in the others). Genes collectively occupy 44% of the genome (481 Mb), marking the smallest genic fraction (others range from 55% to 68%). Introns, totaling 233,110, account for 40% of the genome (434 Mb) with an average length of 1,862 bp, also representing the lowest proportional content in the comparison. The structural and compositional differences collectively indicate that the P. nieuhofii genome exhibits a more compact gene structure and less complexity in transcript diversity compared to the other catfish assemblies studied.
To assess the completeness of the annotation, BUSCO analysis was conducted using the actinopterygii_odb10 lineage dataset. The analysis revealed that 97.7% of the 3,640 expected single-copy orthologs were complete, with 79.3% identified as single-copy and 18.5% as duplicated BUSCOs (https://doi.org/10.5281/zenodo.17429587).26 Only 0.7% were fragmented and 1.6% were missing, indicating a highly complete and well-annotated gene set. The high BUSCO score highlights the robustness of the genome annotation, affirming its appropriateness for subsequent biological and comparative analyses.
The project contains three underlying data:
The raw whole genome sequences are available on NCBI’s Short Read Archive (SRA): Whole genome sequence of Prophagorus nieuhofii, under accession number: SRR34064805 (https://www.ncbi.nlm.nih.gov/sra/?term=SRR34064805). The raw sequences were also deposited and are accessible in the Zenodo repository: Data set for draft genome assembly of the slender walking catfish, Prophagorus nieuhofii (DOI: https://doi.org/10.5281/zenodo.16689652).28
The genome assembly data were deposited and made accessible in Dataverse: Replication data for draft genome assembly of the slender walking catfish, Prophagorus nieuhofii (DOI: https://hdl.handle.net/20.500.12690/RIN/ULQDHU).29 The genome annotation data were deposited and made accessible in the Zenodo repository: Genome annotation of the previously assembled genome of the slender walking catfish, Prophagorus nieuhofii (DOI: https://doi.org/10.5281/zenodo.17422526).30
A snail graph and BUSCO assessment results of genome assembly of slender walking catfish, Prophagorus nieuhofii has been deposited in the Zenodo (DOI: https://doi.org/10.5281/zenodo.17429587).26
All the underlying and extended data of this study are openly available under the terms of Creative Commons Zero v1.0 (CC0 1.0) Universal Public Domain Dedication.
We would like to thank BRIN and LPDP for funding this research through the RIIM funding program. We would also like to thank Integrated Genome Factory (IGF) Faculty of Biology UGM and Yayasan Satriabudi Dharma Setia (YSDS) for providing the sequencing service and the computing facility access.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genomics, genetics, breeding,
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Molecular biology, genomics, genome assembly
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Evolutionary biology, genomics, population genetics
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |||
|---|---|---|---|
| 1 | 2 | 3 | |
|
Version 2 (revision) 10 Nov 25 |
|||
|
Version 1 14 Aug 25 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)