Keywords
genome assembly, groundhog, woodchuck, genome annotation
This article is included in the Genomics and Genetics gateway.
genome assembly, groundhog, woodchuck, genome annotation
Groundhogs (Marmota monax), also known as woodchucks, belong to the same family of ground squirrels as the alpine marmot, Marmota marmota. Groundhogs are found throughout the eastern United States and across much of Canada. They are small, ground-dwelling rodents that weigh ~4 kg as adults.
The woodchuck is of interest to biomedical science as a model for Hepatitis B virus (HBV) infection in humans, due to endemic infections of woodchucks with woodchuck hepatitis virus (WHV), which is genetically similar to human HBV and causes a similar course of infection1. Unlike some animal models of hepatocellular carcinoma (HCC) that require immunocompromised animals, woodchucks can develop HCC spontaneously after WHV infection. This propensity makes the woodchuck a promising model of HBV-induced hepatocellular carcinoma in humans. This in turn motivated our efforts to sequence, assemble, and annotate its genome.
DNA was collected from a healthy, wild-caught adult male woodchuck (WC2) captured in 2016 near Ithaca, New York by Northeastern Wildlife, Inc. The gDNA was isolated from the left medial lobe of the liver from animal WC2. All DNA used for sequencing came from the same animal.
We generated 3.17 billion paired, 150-bp Illumina reads, for a total of 951 Gbp or approximately 390X genome coverage. We generated 32 million reads using Pacific Biosciences sequencing technology, of which 2.59 million were at least 10,000 bp long. The long PacBio reads contained 42.0 Gbp and had an N50 length of 16,554 bp. We also generated 6.4 million Oxford Nanopore (ONT) reads, of which 1.57 million were at least 10,000 bp long. The long ONT reads totaled 22.2 Gbp and had an N50 length of 13,815 bp. We then assembled the Illumina reads, the PacBio 10Kb+ reads, and the ONT 10Kb+ reads using MaSuRCA v3.2.72.
The resulting assembly, Woodchuck_1.0, consists of 8,860 contigs containing 2,737,034,741 bp, with an N50 contig size of 1,094,236. We compared our assembly to a recently published assembly of another woodchuck from the same species, GenBank accession GCA_901343595.13. That assembly (MONAX5) was generated entirely from Illumina reads, and it has a total length of 2,552,052,516 bp in 48,534 scaffolds, with a scaffold N50 of 892 kb and a contig N50 of 74,495 bp. The earlier assembly is thus ~185 Mbp shorter than Woodchuck_1.0.
We aligned all contigs and scaffolds between the two assemblies, and found that 3791 scaffolds in MONAX5 were contained within longer contigs in Woodchuck_1.0, with an average identity of 99.24%. In contrast, only 84 contigs from Woodchuck_1.0 were contained in MONAX5 scaffolds, consistent with the much larger contig sizes in our assembly.
We mapped the annotation from MONAX5 to Woodchuck_1.0 using Liftoff4. To assign functions to the mapped transcripts, we aligned them to transcripts annotated in the Alpine marmot (M. marmota, GenBank accession GCA_001458135.15. This yielded 20,559 protein-coding genes with 28,135 transcripts (including alternative splice variants). 10,664 of the genes were assigned functions based on near-identical matches with the Alpine marmot annotation, and the rest were labeled as hypothetical proteins. The average transcript contains 7.9 exons.
Data from Marmota monax is available at NCBI under BioProject PRJNA587092, including the assembly with annotation at GenBank accession WJEC00000000, and the read data in the Sequence Read Archive under the same BioProject. The assembly and annotation are also available at ftp://ftp.ccb.jhu.edu/pub/data/Groundhog.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Genomics, genome assembly.
Is the rationale for creating the dataset(s) clearly described?
Yes
Are the protocols appropriate and is the work technically sound?
Yes
Are sufficient details of methods and materials provided to allow replication by others?
Yes
Are the datasets clearly presented in a useable and accessible format?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Metabolism
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 16 Sep 20 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)