Keywords
Caiman crocodilus, spectacled caiman, genome, assembly, next-generation sequencing, crocodilian, vertebrate genome
The common, or spectacled, caiman Caiman crocodilus is an abundant, widely distributed Neotropical crocodilian exhibiting notable morphological and molecular diversification. As the type species for the Caimaninae subfamily - the sister taxa for the subfamily to which members of the genus Alligator belong - C. crocodilus occupies a key position in our understanding of crocodilian and archosaur genetics and evolution. The species also accounts by far for the largest share of crocodilian hides on the global market, with the C. crocodilus hide trade alone valued at about US$86.5 million per year. Thus, the genome sequence of C. crocodilus can potentially be of considerable use for both basic and applied research. We obtained 239,911,946 paired-end reads comprising approximately 72 G bases using Illumina TM sequencing of tissue sampled from a single Caiman crocodilus individual. These reads were de-novo assembled and progressively aligned against the genomes of increasingly related crocodilians; liftoff was used to annotate the draft C. crocodilus genome assembly based on an Alligator mississipiensis (a confamilial species) annotation. The draft C. crocodilus genome assembly and sequences reads have been deposited with the National Center for Biotechnology Information with accession numbers JAGPOW000000000.1 for the assembly, and SRR22317059 for the sequence read archives under Bioproject PRJNA716363.
Caiman crocodilus, spectacled caiman, genome, assembly, next-generation sequencing, crocodilian, vertebrate genome
The key comments both Reviewers highlighted are the need to (i) make the underlying data more readily available by including them in more established and standard repositories, and (ii) include additional analyses characterizing the draft assembly and annotation results. Briefly, in response to comments by all the reviewers, we report further summary statistics that allow readers to put our genome assembly in context, including aspects of the annotation requested by the reviewers based on the annotation submitted to Genbank.
We also now include NCBI accession numbers for the sequence read archive (SRA) and draft assembly. The draft annotation to Genbank has been in the processing stage there for some time now, so we have not yet been issued an accession number at the time of submitting the present revision. Nevertheless, we note that our annotation submission has passed all automated checks on NCBI’s end.
See the authors' detailed response to the review by Marc Tollis
See the authors' detailed response to the review by Steven Salzberg
The common, or spectacled, caiman, Caiman crocodilus, is one of the most widely distributed and abundant crocodilian species, ranging continuously from Mexico to Argentina (Busack and Pandya 2001; US Fish and Wildlife Service 2018). A generalist predator, C. crocodilus is remarkably adaptable, occupying a wide range of habitats from urban to seasonal savannahs to tropical rainforests (Medem 1981, 1983), and has recently been introduced to Cuba, Puerto Rico and Florida where it is considered an invasive species (US Fish and Wildlife Service 2018). The broad distribution and diversity of habitats has facilitated considerable intraspecific diversification within C. crocodilus; a recent analysis by Roberto et al. (2020) identified between seven and ten lineages within C. crocodilus across differing biogeographic regions and watersheds throughout Central and South America. Within-species diversity is also morphologically apparent, with skull shape in particular exhibiting systematic patterns of regional differentiation (Medem 1955; Gans 1980; Medem 1981, 1983; Ayarzaguena 1984; Escobedo-Galván et al. 2015). These intraspecific patterns of cranial shape variation within C. crocodilus have been shown to parallel patterns of interspecific cranial diversity found in extant crocodilians (Okamoto et al. 2015).
Additionally, C. crocodilus is a species of commercial importance, chiefly in the leather industry. While the hides of C. crocodilus contain osteoderms that render the manufacturing process more difficult than for other crocodilians, a majority of the approximately 1.5 million crocodilian skins traded globally come from C. crocodilus (Brazaitis et al. 1998; Caldwell 2015). As with other crocodilians, most legal hides come from commercial farming operations, and the market for caiman hides is estimated to be over US $85 million (Caldwell 2015). Wild populations of C. crocodilus are also hunted for meat and even fishing bait (Da Silveira and Thorbjarnarson 1999; Brum et al. 2015; Pimenta et al. 2018) and provide ecosystem services including nutrient cycling and biological control (Valencia-Aguilar et al. 2013; Marley et al. 2019). Due to its role as an apex predator, C. crocodilus exhibits considerable bioaccumulation, with genotoxic analyses demonstrating molecular signatures of pollution on the C. crocodilus genome (Oliveira et al. 2021).
Thus, a draft genome sequence for C. crocodilus can not only assist with improved husbandry, ecotoxicology and wildlife management, but also has the potential to provide insight into evolutionary processes driving intraspecific diversification in continental systems more broadly.
Such a genome sequence can further propel both basic and applied research beyond C. crocodilus. At present, five other crocodilian genome sequence assemblies are available - two each in the genera Alligator (the American and Chinese alligator, A. mississippiensis and A. sinensis, respectively) and Crocodylus (the Saltwater and Cuban crocodile, Cr. porosus and Cr. rhombifer, respectively), and one - the Gharial Gavialis gangeticus - in the genus Gavialis. Beyond their utility to economic (e.g., Miles et al. 2009) and conservation (e.g., Vashistha et al. 2020; Yang et al. 2023) activities, crocodilian genome assemblies have facilitated investigating such basic research questions as the evolution of temperature-dependent sex determination (e.g., Rice et al. 2017), the rate and nature of archosaur genome evolution (especially as determined in comparison with avian genomes - e.g., St. John et al. 2012; Green et al. 2014; Brittain et al. 2021), and the genetic basis of key evolutionary adaptations in amniotes, including, among others, immune responses (e.g., Wan et al. 2013; López-Pérez et al. 2022; Merchant et al. 2024), morphogenesis (e.g., Kusumi et al. 2013; Wu et al. 2018; Morris and Abzhanov 2021) and globin expression (e.g., Wan et al. 2013; Hoffmann et al. 2018; Natarajan et al. 2023). A C. crocodilus genome sequence could therefore provide a useful complement to these broader comparative genomic studies, which routinely use genomes from the genus Alligator, by including genomic data from a widely-distributed, living representative of Alligator’s sister taxa.
DNA was extracted from a tissue sample belonging to a single Caiman crocodilus museum specimen (UF-FLMNH 171438) using the DNeasy ™kit from Qiagen (Hilden, Germany). DNA was quantitated using Thermofisher’s (Waltham, MA, USA) Picogreen ™kit (for a final Picogreen concentration of 77.78 ng/ L). Tecan’s (Männedorf, Switzerland) NuGEN Celero ™kit was then used to construct a paired-end library, which was subsequently sequenced on a single Illumina (San Diego, CA, USA) NovaSeq S4 lane. This yielded 239,911,946 paired-end reads of 2 × 150 bp each. Nucleic acid isolation, quantitation, library generation and raw-read sequencing were performed at the University of Minnesota Genomics Center.
The paired-end reads (Sequence Read Archive available at Genbank with Accession number SRR22317059) were assembled de novo using the Iterative de Bruijn Graph Assembler (IDBA-UD; Peng et al. 2012). To assess the reliability of our pipeline from sequencing to de novo assembly using IDBA-UD, we repeated the sequencing and assembly using a museum-derived tissue sample from a single Alligator mississippiensis individual (UF-FLMNH 175565). This resulted in 249,325,204 paired-end reads of 2 × 150 bp each. As was the case for the C. crocodilus individual, the reads were then de novo assembled using IDBA-UD, and we used QUAST (Gurevich et al. 2013) to determine that the IDBA assembly of A. missippiensis captured approximately 94.2% of a recently published A. missippiensis assembly (GCA_000281125.4; Rice et al. 2017), with an NG50 of 21172 based on de novo assembled contigs alone.
We scaffolded the resulting draft C. crocodilus contigs using a two-step procedure. First, we scaffolded the caiman’s contigs against a Crocodylus porosus assembly (GCF_001723895.1; Ghosh et al. 2020) using ragtag (Alonge et al. 2019). We then re-scaffolded the resulting contigs/scaffolds against the confamilial Alligator mississipiensis assembly (GCA_000281125.4), again using ragtag.
Contaminants, mitochondrial DNA, vectors, adapters, and sequences shorter than 200 bp identified by NCBI were manually removed using seqkit (Shen et al. 2016) and custom scripts (available at http://github.com/kewok/ncbi_scrubber). The genome assembly has been deposited to Genbank with accession number JAGPOW000000000.1.
The resulting scaffold (10.5281/zenodo.4755063) was then masked using RepeatMasker (Smit et al. 2015) relying on the HMMER database (Finn et al. 2011) and with “alligator” specified as species. Liftoff (Shumate and Salzberg 2020) was then used to generate a draft annotation based on the masked assembly using the annotations associated with A. mississipiensis (GCA_000281125.4; Rice et al. 2017) as a reference.
table 2asn_gff (National Center for Biotecnology Information 2020) was used to generate a Sequin file (National Center for Biotechnology Information (US) 2014), and features flagged as errors were manually removed using custom scripts (available at https://github.com/kewok/ncbi_scrubber); as of December 2024 the draft annotation is available at 10.5281/zenodo.4755063.
Our assembly yielded a draft genome sequence of length 2,341,057,913 bp with 465,471 scaffolds and 723,636 contigs. Our draft C. crocodilus genome assembly has a scaffold N50 of 70,464,410 bp, or approximately 70.5 Mbp (Telatin et al. 2021). For context, in other crocodilian assemblies, scaffold N50s of approximately 478.2 Kbp, 2.2 Mbp 96.1 Mbp, 84.4 Mbp and 255.1 Mbp are reported for the Cuban crocodile (GCA_038503035.1; Meredith et al. 2024), the Chinese alligator (GCF_000455745.1; Wan et al. 2013), the gharial (Green et al. 2014), the Saltwater crocodile (GCF_001723895.1; Rice et al. 2017), and the American alligator (GCF_030867095.1), respectively. Among other reptile reference genome assemblies, the scaffold N50 we report is comparable in value to those reported for the reference genome assemblies of the common mock viper (Psammodynastes pulverulentus; GCA_024509165.1), the rock pigeon (Columba livia; GCF_036013475.1) and the Asian water monitor (Varanus salvator; GCA_023646645.1).
A QUAST analysis of contigs with more than 3,000 bp against the reference A. mississippiensis assembly GCA_000281125.4_ASM28112v4 identified 211 local misassemblies, 22 misassemblies (of which 14 are contig translocations, 6 are scaffold relocations and 2 are scaffold translocations). The misassembled contigs length is 4,572,832 bp.
We further used BUSCO (Simão et al. 2015) to evaluate the gene completeness of our C. crocodlius draft genome, querying against the sauropsida_odb10 database. This assessment yielded 7,224 out of 7,480 complete BUSCOs (for a completeness score of 96.5%), of which 7,176 were single-copy complete BUSCOs and 48 were duplicated BUSCOs.
A total of 297,374 gene features were predicted for the annotation. Using AGAT (Dianat 2020), we identified 18,836 functional transcripts, 20,020 mRNAs and 15,981 coding sequences with average lengths of 37,890bp, 39,524 bp and 1,063 bp, respectively. 115,941 exons (average length 562 bp) were identified, with an average of 7.3 exons per coding sequence, and 99,960 introns (average length 4,334 bp) were identified in the coding sequences. We further used HMMER (Eddy 2011) to determine the number of pFam protein families database hits for our annotation, finding 9,983 hits at the E=0.00001 sequence reporting threshold. The number of hits were comparable at the E=0.001 (10,255 hits) and E=0.0000001 (9,745 hits) levels. Finally, we conducted a reciprocal BLAST hit analysis against an annotation for the Saltwater crocodile Crocodylus porosus (GCF_001723895.1_CroPor_comp1), a non-alligatorid crocodilian for whom an annotation is presently available. Briefly, this Cr. porosus annotation contains 28,663 coding sequences with an average length of 1,527 bps and 19,538 genes and pseudogenes. Using proteinortho (Lechner et al. 2011), our reciprocal BLAST hit analysis found 7,894 orthologous groups across the annotations.
Here we have described the first draft assembly and annotation of the C. crocodilus genome. We feel these data can assist natural resource management, ecotoxicology, agriculture, as well as research into broader questions about the interplay between microevolutionary and macroevolutionary processes across broad biogeographic scales. In addition to potentially facilitating both basic and applied research into C. crocodilus biology, our C. crocodilus genome sequence expands the available crocodilian genome sequences to include the subfamily Caimaninae, the extant sister group to Alligator and a major crocodilian lineage hitherto unrepresented among assembled genome sequences. Our assembly can thus provide a useful resource not only for crocodilian genomics, but also for archosaur, reptile and amniote comparative genomics more broadly.
The draft C. crocodilus genome assembly and sequence data have been deposited with the National Center for Biotechnology Information with accession numbers JAGPOW000000000.1 for the assembly, and SRR22317059 for the sequence read archives under Bioproject PRJNA716363. At present, the draft annotatation is in processing at the National Center for Biotechnology Information and is currently available for review at (doi.org/10.5281/zenodo.4755063).
We are especially indebted to Dr. P. S. Soltis, T. A. Lott and the Genetic Resources Repository at the University of Florida - Florida Natural History Museum (UF-FLNHM) for generously providing us with tissue samples. We would like to thank the University of Minnesota Genomics Center (Minneapolis, MN, USA) for their guidance and for isolating DNA from museum samples, and for performing library preparation and raw sequencing. We wish to thank the Minnesota Supercomputing Institute (MSI) at the University of Minnesota and the Department of Chemistry at the University of St. Thomas for providing critical computational resources that contributed to the research results reported within this paper. Finally, we are very grateful to Dr. S. Pirro and Dr. M. Kieras at Iridian Genomes (Bethesda, MD, USA) for valuable insight on scaffolding the draft assemblies, as well as two Reviewers for comments that significantly improved the manuscript.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genomics, bioinformatics
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Comparative genomics, phylogenetics, vertebrates
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Partly
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: genomics, bioinformatics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 15 Jan 25 |
read | |
Version 1 02 Dec 21 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)