Keywords
Genome assembly, reference genome, transcriptome, Aves, mitogenome
The brown thornbill (Acanthiza pusilla) is a songbird endemic to eastern Australia with five recognised subspecies within the brown thornbill. The most notable is the King Island brown thornbill (Acanthiza pusilla magnirostris) of which there are less than 100 remaining and based on expert elicitation are the most likely Australian bird to become extinct in the next 20 years. We sequenced PacBio HiFi reads of the brown thornbill to generate a high-quality reference genome 1.25Gb in size and contig N50 of 20.1Mb. Additionally, we sequenced mRNA from three tissues to generate a global transcriptome to aid with genome annotation. The generation of a reference genome for the brown thornbill provides an important resource to align additional genomic data which will be produced in the near future.
Genome assembly, reference genome, transcriptome, Aves, mitogenome
The brown thornbill (Acanthiza pusilla) is a small species of songbird within the Acanthizidae family endemic to eastern and south-eastern Australia, including Tasmania (Higgins & Peter, 2002). There are five subspecies recognised within the brown thornbill including the Critically Endangered King Island brown thornbill (Acanthiza pusilla magnirostris). This taxon is considered the most likely Australian bird to become extinct within the next 20 years, based on expert elicitation (Geyle et al., 2018). Whilst the nominate brown thornbill is of least conservation concern, there are thought to be fewer than 100 King Island brown thornbills occurring on King Island (area 1098 km2), in the Bass Strait (Bell, Webb, Holdsworth, & Baker, 2023). Whilst surveys are ongoing, the King Island brown thornbill is understood to be restricted to patches of mature eucalypt forest on King Island, where it primarily forages in the canopy and in the crevices of bark on tree trunks (Bell et al., 2023).
The generation of a reference genome and associated transcriptomic data is a vital for informing genetic management of the King Island subspecies and can be used to align genetic data that will be produced in the near future. The genome is also the first for the genus Acanthiza, contributing to global efforts to sequence life on Earth (Lewin et al., 2022).
To facilitate detailed genomic research on this species, we sequenced DNA with PacBio HiFi long reads to generate a high-quality reference assembly and sequenced RNA from three tissues to provide transcriptomic resources and assist in genome annotation for the brown thornbill.
A single wild male brown thornbill (B974_KIBT) from Tasmania was captured using a mist net and euthanised for genome and transcriptome generation under Australian National University Animal Experimental Ethics Committee program of wildlife authorisation approval number #A2021/33 (approval date (13/07/2021) and Tasmanian Scientific Permit #TFA23010. Every effort was made to reduce suffering of animals, including (i) collecting the minimum number of animals required for the study (one); (ii) pre-arranging animal euthanasia with a qualified veterinarian; (iii) collection of the animal from as close to the location of the veterinarian as practically possible to minimise transportation time; and (iv) transportation in a soft, dark material ‘bird bag’ to minimise stress during transportation. Tissue samples were dissected and flash frozen at -80°C or preserved in RNA later before being frozen at -80°C. High molecular weight (HMW) DNA was then extracted from kidney tissue using the Nanobind Tissue Big DNA Kit v1.0 (Circulomics). A Qubit fluorometer was used to assess the concentration of DNA with the Qubit dsDNA BR assay kit (Thermo Fisher Scientific). Total RNA was extracted from liver, brain and gonads using the RNeasy Tissue Kit (Qiagen) with RNAse-free DNAse I set (Qiagen). RNA quality was determined using the NanoDrop (Thermo Fisher Scientific) and RNA integrity (RIN) score determined using the Bioanalyzer RNA nano 6000 kit (Agilent 2100).
HMW DNA was sent for Pacific Biosciences High Fidelity (PacBio HiFi) library preparation with the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) and sequencing on one single molecule real-time (SMRT) cell of the PacBio Revio machine at the Australian Genome Research Facility (St Lucia, Australia). Total RNA from the liver, brain and gonads was sequenced as 150 bp paired-end (PE) reads using an Illumina Novaseq X with Illumina Stranded mRNA library preparation at the Ramaciotti Centre for Genomics (University of New South Wales, Kensington, Australia).
The genome assembly was conducted on Galaxy Australia (The Galaxy Community, 2022) public server usegalaxy.org.au (Afgan et al., 2016) running the Genome assembly with ‘hifiasm’ (RRID:SCR_021069) on Galaxy Australia workflow v2.1 (Price & Farquharson, 2022). Briefly, Picard (http://broad institute.github.io/picard) (Galaxy version 2.18.2.2; RRID:SCR_006525) SamToFastq, samtools (Danecek et al., 2021; Li et al., 2009) (Galaxy version 2.0.3; RRID:SCR_002105) flagstat and fastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc) (Galaxy version 0.72; RRID:SCR_014583) was used to convert BAM files to FASTQ and quality check the reads for input to Hifiasm (Cheng, Concepcion, Feng, Zhang, & Li, 2021; Cheng et al., 2022). Hifiasm (Galaxy version 2.1) was run on Galaxy Australia to assembly the genome. Basic genome assembly statistics were calculated using the stats.sh script in BBMap v37.98 (sourceforge.net/projects/bbmap/) (RRID:SCR_016965). Genome completeness was determined using Benchmarking Universal Single-Copy Orthologues (BUSCO; RRID:SCR_015008) v5.4.6 (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) with both the vertebrata_odb10 (n = 3354) and aves_odb10 (n= 8338) lineages on Galaxy Australia. Genome completeness and base accuracy was also determined Merqury v1.3 (RRID:SCR_022964) (Rhie, Walenz, Koren, & Phillippy, 2020), implemented in the Genome assessment post assembly workflow on Galaxy Australia (Price, 2023). Repetitive elements of the genome were identified, classified and masked on a Pawsey Supercomputing Centre Nimbus cloud machine (256GB RAM, 64 vCPU, 3 TB storage) by building a database using RepeatModeler v2.0.1 (RRID:SCR_015027) (Flynn et al., 2020); repeats were then masked using RepeatMasker v4.0.9 (RRID:SCR_012954) (Smit, Hubley, & Green, 2013-2015) with the -nolow parameter to avoid masking low complexity repeats.
The contig representing the mitochondrial genome was identified from the reference genome assembly using MitoHiFi v2 (Allio et al., 2020; Uliano-Silva et al., 2023) and visualised using Proksee (Grant et al., 2023). MitoHiFi identified the yellow thornbill (Acanthiza nana) as the most taxonomically closely related publicly available mitochondrial genome (KY994614.1), used to search for the brown thornbill mitochondrial genome.
Transcriptome assembly was performed on the University of Sydney’s High Performance Computer, Artemis. Raw transcriptome reads were quality assessed pre- and post-trimming with FastQC v0.11.8 (RRID:SCR_014583). Trimmomatic v0.39 (RRID:SCR_011848) (Bolger, Lohse, & Usadel, 2014) with the parameters SLIDINGWINDOW:4:5, LEADING:5, TRAILING:5 and MINLEN:25 and ILLUMINACLIP:2:30:10 with the TruSeq3-PE adapters was used to quality trim reads. The repeat masked genome was indexed and trimmed reads aligned using the -dta parameter with hisat2 v2.1.0 (RRID:SCR_015530) (Kim, Paggi, Park, Bennett, & Salzberg, 2019). Resulting sam files with converted to bam format and sorted using samtools v1.9 (Danecek et al., 2021; Li et al., 2009). Stringtie v2.1.6 (RRID:SCR_016323) (Pertea et al., 2015) was used to generate a GTF for each transcriptome. Stringtie v2.1.6 with the -merge parameter merged transcripts into a global transcriptome retaining only transcripts with a fragments per kilobase of exon per million mapped fragments (FPKM) > 0.1 and length > 30. CPC2 v2019-11-19 (Kang et al., 2017) was used to predict coding potential and only transcripts predicted to be coding were retained. TransDecoder v2.0.1 (https://github.com/TransDecoder/TransDecoder) (RRID:SCR_017647) was used to predict open reading frames in the global transcriptome with a minimum transcript length of 20. Transcriptome completeness was assessed using BUSCO v5.4.6 (Simao et al., 2015) with the vertebrata_odb10 (n= 3354) and aves_odb10 (n = 8338) lineages on Galaxy Australia.
Genome annotation was performed using FgenesH++ v7.2.2 (Softberry; RRID:SCR_018928 (Solovyev, Kosarev, Seledsov, & Vorobyev, 2006)) using the longest open reading frame as predicted from the global transcriptome, non-mammalian settings and optimised parameters supplied with the American crow (Corvus brachyrhynchos) gene finding matrix. BUSCO v5.4.6 (Simao et al., 2015) in protein mode was run on Galaxy Australia to assess the completeness of the annotation with the vertebrata_odb10 (n = 3354) and aves_odb10 (n = 8338) lineages. The ‘genestats’ script (https://github.com/darencard/GenomeAnnotation) was used to obtain the average number of exons and introns and the average exon and intron length.
The hifiasm assembly of the brown thornbill from PacBio HiFi data resulted in a genome 1.25Gb in size consisting of 1,000 contigs and sequenced to a depth of 43x. The longest contig in the assembly is 97.7 Mb and the assembly has an N50 of 20.1 Mb and L50 of 17 (Table 1). The genome is also highly complete with 96.9% complete Aves BUSCOs present in the assembly (Table 1). Merqury analysis also indicated a high-quality genome with QV > 59 and 87.1% complete k-mers. The mitochondrial genome is 16,862 bp and contains 37 genes including 22 tRNAs and 13 genes and 2 rRNAs (Figure 1). Repeat masking identified 19.06% of the genome as repeats (Table 2), which is in a similar range to other bird species (Zhang et al., 2014).
All individual tissues had alignments rates greater than 85% against the repeat masked reference genome (liver: 93.52%, brain: 91.75% and gonads: 89.24%). A total of 45,082 transcripts were predicted to have coding potential and 12,549 longest open reading frame transcripts were used as input for genome annotation with FgenesH. A total of 29,706 genes were predicted by the FgenesH annotation software, with the annotation containing 73.9% complete aves_obd10 BUSCOs (Table 3). There were an average number of 7.42 exons and 6.42 introns per gene (Table 3).
Luke W. Silver
Roles: Data curation, formal analysis, investigation, software, methodology, Writing -original draft Preparation
Ross Crates
Roles: Conceptualization, Funding acquisition, data collection, administration, writing-original draft
Dejan Stojanovic
Roles: Data collection, Writing – Review & Editing
Catherine M. Young
Roles: Data collection, Writing – Review & Editing
Katherine Belov
Roles: Conceptualization, Funding Acquisition, Supervision, Writing – Review & Editing
Katherine A. Farquharson
Roles: Methodology, Supervision, Writing – Review & Editing
Rob Heinsohn
Roles: Administration, Supervision, Writing – Review & Editing
Carolyn J. Hogg
Roles: Conceptualization, Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing
The raw PacBio HiFi and transcriptome data are publicly available through the Bioplatforms Australia Threatened Species Initiative: https://data.bioplatforms.com/organization/threatened-species . The assembled genome, global transcriptome and annotation generated in this study are available on Amazon Web Services Australasian Genomes Open Data Store: https://awgg-lab.github.io/australasiangenomes/genomes.html.
Raw genome and transcriptome sequences are also available from:
NCBI’s Short Read Archive (SRA): Raw RNA data for the generation of transcriptome. SRR26937195 (https://www.ncbi.nlm.nih.gov/biosample/38393458) (Silver et al., 2024)
NCBI’s Short Read Archive (SRA): Raw RNA data for the generation of transcriptome. SRR26937196 (https://www.ncbi.nlm.nih.gov/biosample/38393457) (Silver et al., 2024)
NCBI’s Short Read Archive (SRA): Raw RNA data for the generation of transcriptome. SRR26937197 (https://www.ncbi.nlm.nih.gov/biosample/38393456) (Silver et al., 2024)
NCBI’s Short Read Archive (SRA): Raw DNA data for the generation of genome. SRR26937198 (https://www.ncbi.nlm.nih.gov/biosample/38393455) (Silver et al., 2024)
The data produced as part of this study are stored on NCBI under BioProject PRJNA1044448. Databases of molecular data on the NCBI Web site include such examples as nucleotide sequences (GenBank), protein sequences, macromolecular structures, molecular variation, gene expression, and mapping data. They are designed to provide and encourage access within the scientific community to sources of current and comprehensive information. Therefore, NCBI itself places no restrictions on the use or distribution of the data contained therein.
Figshare: Checklist for ARRIVE.pdf for Genomic and transcriptomic resources for the brown thornbill (Acanthiza pusilla) to support the conservation of a critically endangered subspecies, https://doi.org/10.6084/m9.figshare.25396282.v1
The authors would like to thank the Tasmanian Museum and Art Gallery (TMAG) for dissection, storage and shipping of the samples used to generate the genome and transcriptomes. Computational resources were provided by the Australian FGENESH++ Service provided by the Australian BioCommons and the Pawsey Supercomputing Research Centre with funding from the Australian Government and the Government of Western Australia; Galaxy Australia, a service provided by the Australian Biocommons and its partners; and the University of Sydney’s High Performance Computing facility Artemis provided by the Sydney Informatics Hub. The authors wish to acknowledge the use of the services and facilities of the Ramaciotti Centre for Genomics, UNSW and of the Australian Genome Research Facility.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Are the rationale for sequencing the genome and the species significance clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genome assembly, population genomics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |
---|---|
1 | |
Version 1 23 Apr 24 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)