ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing

[version 2; peer review: 2 approved, 2 approved with reservations]
PUBLISHED 03 Aug 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Nanopore Analysis gateway.

Abstract

Background: The introduction of the MinION sequencing device by Oxford Nanopore Technologies may greatly accelerate whole genome sequencing. Nanopore sequence data offers great potential for de novo assembly of complex genomes without using other technologies. Furthermore, Nanopore data combined with other sequencing technologies is highly useful for accurate annotation of all genes in the genome. In this manuscript we used nanopore sequencing as a tool to classify yeast strains.
Methods: We compared various technical and software developments for the nanopore sequencing protocol, showing that the R9 chemistry is, as predicted, higher in quality than R7.3 chemistry. The R9 chemistry is an essential improvement for assembly of the extremely AT-rich mitochondrial genome. We double corrected assemblies from four different assemblers with PILON and assessed sequence correctness before and after PILON correction with a set of 290 Fungi genes using BUSCO.
Results: In this study, we used this new technology to sequence and de novo assemble the genome of a recently isolated ethanologenic yeast strain, and compared the results with those obtained by classical Illumina short read sequencing. This strain was originally named Candida vartiovaarae (Torulopsis vartiovaarae) based on ribosomal RNA sequencing. We show that the assembly using nanopore data is much more contiguous than the assembly using short read data. We also compared various technical and software developments for the nanopore sequencing protocol, showing that nanopore-derived assemblies provide the highest contiguity.
Conclusions: The mitochondrial and chromosomal genome sequences showed that our strain is clearly distinct from other yeast taxons and most closely related to published Cyberlindnera species. In conclusion, MinION-mediated long read sequencing can be used for high quality de novo assembly of new eukaryotic microbial genomes.

Keywords

Nanopore sequencing, de novo genome assembly, wild type yeasts, ethanologenic, Candida, Cyberlindera

Revised Amendments from Version 1

We have generated more data with the most current library prep chemistry to increase the confidence on highly heterozygous assembly results. Data used for assembly was filtered more stringently to facilitate the rather difficult assembly process, aiming to decrease assembly fragmentation and increase contiguity. We have included as set of 3 additional assemblers (TULIP, Miniasm and SMRTdenovo), compared assembler results on contiguity and completeness using statistical measurements and BUSCO gene prediction software before and after contig correction using Illumina data. The BUSCO analysis is used to highlight the biological relevance of the assemblies as well as to indicating the heterozygous diploid properties of our strain.
 
This study has been adjusted guided by the suggestions of the peer reviewers. Hans Jansen generated new data and visualized mitochondrial coverage for all used chemistries. Michael Liem performed all assemblies, assembly correction and assembly comparison, gene prediction analysis and contig comparisons including all rewriting of the manuscript and generated and/or edited figures 1, 3, 4, 5 and 6. Hence the author list order has been changed accordingly.
 
Figure changes – Fig. 1) Tick spacing has been adjusted. Fig. 3) Published mitochondrial reference is used, visualization of new dataset and GC-content are added. Fig 4.) New figure. Fig. 5) Comparison to S. cerevisiae is removed and comparisons are based on newly generated TULIP assemblies. Fig. 6) New Figure.
 
Finally, data characteristic, such as number of reads, number of bases, coverage and GC-content, are added to the previous table.

See the authors' detailed response to the review by Mile Šikić
See the authors' detailed response to the review by Hayan Lee
See the authors' detailed response to the review by Jean-Marc Aury and Istace Benjamin
See the authors' detailed response to the review by Christina A. Cuomo

Introduction

With the development of robust second generation bioethanol processes, next to the use of highly engineered Saccharomyces cerevisiae strains1,2, non-classical ethanologenic yeasts are also being considered as production organisms3,4. In particular, aspects concerning the ability to use both C6 and C5 C-sources and feedstock derived inhibitor resistance have been identified as important for the industrial applicability of different production hosts3. In our previous studies we have identified a novel ethanologenic yeast, Wickerhamomyces anomala, as a potential candidate3. Based on this research, a further screen for alternative yeast species was initiated (Punt and Omer, unpublished study) Here we describe the isolation and genomic characterization of one of these new isolates, which was typed as Candida vartiovaarae based on ribosomal RNA analysis.

With the arrival of next generation sequencing and the assemblers that can use this type of sequencing data, whole genome shotgun sequencing of completely novel organisms has become affordable and accessible. As a result, a wealth of genomic information has become available to the scientific community leading to many important discoveries. While generating whole draft genomes has become accessible, these genomes are often fragmented due to the nature of these short read technologies5. Assembling short read data into large contigs proved to be difficult because the short reads do not contain the information to span repeated structures in the genome. Approaches to sequence the ends of larger fragments partially mitigated this problem6.

The new long read platforms from Pacific Biosciences and Oxford Nanopore Technologies made it possible to obtain reads that span many kilobases7. Assemblies using this type of data are often more contiguous than assemblies based on short read data8,9.

We have employed the Oxford Nanopore Technologies MinION device to sequence genomic DNA from the isolated Candida vartiovaarae strain. The same DNA was also used to prepare a paired end library for sequencing on the Illumina HiSeq2500. The sequence data were used in various assemblers to obtain the best assemblies.

Materials and methods

Strain selection and cultivation conditions

In our previous research3, a screening approach was developed to select for potential ethanologens using selective growth on industrial feedstock hydrolysates. Based on this approach, a previously identified microflora from grass silage was screened for growth on different hydrolysates from both woody and cereal residues. From this microflora, a strain was isolated (DDNA#1) after selection on a growth medium consisting of 10% acid-pretreated corn stover hydrolysate, which was shown to be most restrictive in growth due to the presence of relatively high amounts of furanic inhibitors.

DNA purification

Cells were grown at 30°C on plates with YNB (without amino acids) medium supplemented with 0.5% glucose. Cells were scraped from plates and resuspended in 5 ml TE. High MW chromosomal DNA was isolated from yeast isolate DDNA#1 and Saccharomyces cerevisiae S288C using a Qiagen Genomic-tip 100/G column, according to the manufacturer’s instructions.

Pulsed field gel electrophoresis

In order to determine the size of intact chromosomes of DDNA#1, a BioRad CHEF Genomic DNA Plug Kit was used. Briefly, yeast cells were treated with lyticase and the resulting spheroplasts were embedded in low melting point agarose. After incubation with RNase A and Proteinase K, the agarose plugs were thoroughly washed in TE. The DNA in the agarose plugs was separated on a 0.88% agarose gel in 1xTAE buffer on a Bio-Rad CHEF DRII system. The DNA was separated in four subsequent 12 hour runs at 3V/cm; run one and two used a constant switching time of 500 seconds, and in run three and four the switching time increased from 60 seconds to 120 seconds. The gel was afterwards stained with ethidium bromide and imaged.

Genome size estimation and heterozygosity

A k-mer count analysis was done using Jellyfish10 v2.2.6 on the Illumina data. From the paired end reads, only the first read was truncated to 100 bp to avoid the lower quality part of the read. The second read was omitted from this analysis to avoid counting overlapping k-mers. Different k-mer sizes were used ranging from k=17 to 23. After converting the k-mer counts into a histogram format, this file was analyzed using the Genomescope11 tool, available at http://qb.cshl.edu/genomescope/ and https://github.com/schatzlab/genomescope.

Illumina library preparation, sequencing and quality control

High molecular weight DNA from both DDNA#1 and Saccharomyces cerevisiae S288C was sheared using a nebulizer (Life Technologies). The sheared DNA was used to make genomic DNA libraries using the Truseq DNA sample preparation kit, according to the manufacturer’s instructions (Illumina Inc.). In the size selection step, a band of 330–350 bp was cut out of the gel to obtain an insert length of ~270 bp. From the resulting libraries, 4.5 million fragments were sequenced in paired end reads with a read length of 150 nt on an Illumina HiSeq2500, according to the manufacturer’s instructions. The HiSeq control software (HCS) and real time analysis (RTA) software, versions were 2.2.38 and 1.18.61, respectively, were used. To ensure data integrity we have visualized read quality distributions with FastQC12 v0.11.7 and merged overlapping paired end reads, including trimming of low quality regions, using flash13 v1.2.11. Only trimmed and merged reads are used as input data for both Spades14 assemblies and assembly polishing.

MinION library preparation, sequencing and quality control

The genomic DNA was sequenced using nanopore sequencing technology. First the DNA was sequenced on R7.3 flow cells. Subsequently, multiple R9 and R9.4 flow cells were used to sequence the DNA. For R7.3 sequencing runs, we prepared the library using the SQK-MAP006 kit from Oxford Nanopore Technologies. In short, high molecular weight DNA was sheared with a g-TUBE (Covaris) to an average fragment length of 20 kbp. The sheared DNA was repaired using the FFPE Repair Mix, according to the manufacturer’s instructions (New England Biolabs). After cleaning the DNA with bead extraction, using a ratio of 0.4:1 Ampure XP beads (Beckman Coulter) to DNA, the DNA ends were polished and an A overhang was added with the NEBNext End Prep Module (New England Biolabs). Then, prior to ligation, the DNA was again cleaned by extraction using a ratio of 1:1 Ampure XP beads to DNA. The adaptor and hairpin adapter were ligated using Blunt/TA Ligase Master Mix (New England Biolabs). The final library was prepared by cleaning the ligation mix using MyOne C1 beads (Invitrogen).

To prepare 2D libraries for R9 sequencing runs, we used the SQK-NSK007 kit from Oxford Nanopore Technologies. The procedure to prepare a library with this kit is largely the same as with the SQK-MAP006 kit. 1D library preparation was done with the SQK-RAD001 kit from Oxford Nanopore Technologies, which tags high molecular weight DNA using a transposase. The final library was prepared by ligation of the sequencing adapters to the tagmented fragments using the Blunt/TA Ligase Master Mix (New England Biolabs).

The prepared libraries were loaded on the MinION flow cell, which was docked on the MinION device. The MinKNOW software (v0.50.2.15 for SQK-MAP006 libraries and v1.0.5 for SQK-NSK007 and SQK-RAD001 libraries) was used to control the sequencing process and the read files were uploaded to the cloud based Metrichor EPI2ME platform for base calling. Base called reads were downloaded in fastq format. We filtered the data to a per read average maximum error-rate distribution of 10% and a minimum of 10 kbp for quality and length, respectively. Only reads that meet these filtering thresholds were used for assemblies and post-assembly error correction.

Genome assembly and assembly correction

The sequence data from the Illumina platform was assembled using Spades v3.6.0, we performed a two-branch assembly strategy using either exclusively Illumina data or a hybrid approach combining both Illumina and nanopore data sets.

A set of four different assemblers is used to generate contigs exclusively based on nanopore data, Canu15 v1.3, Miniasm16 v0.2, TULIP17 v0.4 and Smartdenovo18 v1.07. These assemblers perform all vs. all alignments on filtered nanopore data to generate the final contigs, with the exception of TULIP, which aligns reads to a set of random 1,000 bp seed sequences comprising 0.5 times the estimated ~12 Mbp genome size. Contigs of all assemblers were post-assembly corrected using Racon19, excluding Canu generated contigs, since Canu contains an integrated self-correction procedure prior to assembly. To obtain optimum sequence correctness the resulting contigs of these four assemblers were polished with Illumina data using PILON20 v1.18 in a double iterative fashion.

The sequencing data, including the final assembly, has been submitted to the European Nucleotide Archive and can be accessed at http://www.ebi.ac.uk/ena/data/view/PRJEB19912.

Genome assembly assessment based on gene prediction

As successful sequence polishing plausibly improves the accuracy of gene prediction, we assessed both assembly quality and PILON correction effects using BUSCO21 v3.0.2. We assessed our nanopore exclusive assemblies both before and after PILON correction using lineage database Fungi 0db9 containing 290 genes. BUSCO genome assembly assessments on Spades contigs correspond to assessments after PILON correction for nanopore derived contigs, since Spades contigs are based on Illumina data and do not require a post-assembly PILON correction. BUSCO identifies genes in genomic assemblies either as partial, single or double copy, or completely absent.

Full genome comparison

From 26S ribosomal RNA sequences available in the nucleotide database, Chen et al.22 have constructed a phylogenetic tree. From that phylogenetic tree we have observed that the closest relative for which whole genome sequences are available is Cyberlindnera jadinii. To compare our draft genome assembly to this yeast species, we retrieved assemblies of two Cyberlindnera jadinii strains, namely NBRC 0988 (GenBank accession number, DG000077.1) and CBS1600 (GenBank accession number, CDQK00000000.1). We also used Saccharomyce cerevisiae S288C (GenBank accession number, GCA_000146045.2) in this comparison. We aligned those assemblies to the corrected draft assembly of our strain using MUMmer’s alignment generator NUCmer23 v3.1). NUCmer’s output was filtered and the filtered results parsed to MUMmerplot, generating full-genome visualization between the pairs of different yeast species. Since Spades assembly-lengths are roughly twice the estimated genome size we additionally evaluated alignments between Spades hybrid and TULIP contigs. Alignments were performed using BWA-mem24 v0.7.15 with -x ontd2 settings and visualized using genome viewer Tablet25 v1.17.08.17.

Read mapping to mitochondrial genome

Reads generated on the Illumina platform were aligned to the published Candida vartiovaarae mitochondrial genome (Genbank accession number, KC993190.1) using Bowtie226 v2.2.5. Reads generated on the MinION platform were aligned using Minimap227 v2.3-r546-dirty. Resulting bam files were sorted and viewed in IGV viewer v2.3.

Results and discussion

Pure cultures of candidate ethanologenic yeasts

From a screen on 10% acid-pretreated corn stover hydrolysate, about 70 individual clones were obtained, only five of which were able to grow well on purely synthetic YNB-based medium. To determine the taxonomic status of these clones, chromosomal DNA was isolated and used for PCR amplification of the ribosomal ITS sequence using ITS-specific primers28 (ITS1 and ITS4).

BLAST analysis of these ITS sequences of all 5 isolates revealed a 100% identity to Candida vartiovaarae (Torulopsis vartiovaarae: NCBI accession number KY102493)

All five isolates were grown on different C-sources and showed growth on glucose, mannose, cellobiose, xylose and glycerol, while growth on L-arabinose was variable. No significant growth was found on galactose and rhamnose. Good growth (on glucose) occurred between 20–30°C, at pH3-7 (optimum 25°C, pH4-5). Based on the results, we concluded that all five isolates originated from a single source in the grass silage sample. Subsequent experiments were therefore carried out with a single isolate now named DDNA#1.

Pulsed field gel electrophoresis

As a further means to validate our assembled contigs and determine if they match the actual chromosome length, we have separated the chromosomes on an agarose gel using pulsed field gel electrophoresis. The gel image in Figure 1 shows five bands that represent the chromosomes of this yeast strain. The smallest band has a length that corresponds to the length of the mitochondrial genome (33 kbp). Additional fragments of 450, 1200, and 1500 kbp are also found. The intensity of the band that runs above the 2200 kbp marker band suggests that it actually contains more than one distinct fragment. To make the genome size fit to the estimate derived from the assembly and k-mer analysis (~12.5 Mbp), three ~3 Mbp chromosomes should be postulated. The uncertainty in chromosome size estimate based on pulsed field electrophoresis gels is high because of the large chromosome size and the fact that it is difficult to determine if more than one fragment is present in the gel at a given position. Our conclusion that the top band represents three or more chromosomes is in agreement with the genome sequences of two related C. jadinii strains, namely CBS1600 and NBRC 0988.

19e34d83-c95f-4b7c-8614-6e67837f805c_figure1.gif

Figure 1. Pulsed field gel electrophoresis of Candida vartiovaarae DDNA#1 chromosomes.

In lane 1, the chromosomes of Saccharomyces cerevisiae were loaded as a marker. Sizes of the chromosomes in the marker lane are indicated. In lane 2, the chromosomes of Candida vartiovaarae DDNA#1 were loaded.

Genome size estimation and heterozygosity

The Illumina sequence data of our DDNA#1 isolate were submitted to the Genomescope software package to analyze the k-mer count distribution, using k-mer size = 19 at an average coverage of 28.0x (Figure 2). The ‘haploid’ genome is predicted to contribute to the most abundant fraction, which corresponds with the second peak (dotted line) in the plot (Figure 2). The first peak corresponds to sequence occurring exactly half as frequently as the main peak, so these are plausibly haplotypes. Due to the nature of k-mer counting, this peak often appears higher than the main peak, because a single SNP will affect all k-mers overlapping that position. The first two peaks contain about 10 Mbp of sequence. Additional peaks at higher coverage indicate duplications and repetitive DNA that are quite abundant, but correspond with less sequence than the second peak. Genomescope estimated a haploid genome size of between 12.00 and 12.01 Mbp. Additionally, Genomescope revealed 3.6% variety across the entire genome indicating that the genome of C. vartiovaarae has strong heterozygous properties (Table 1). A likely possibility is that areas in the genome are replicated and slightly diverged in sequence. This could also explain why we see a large tail of repeated k-mers (Figure 2). It could also explain why our assembly still remained fragmented despite the relatively large amount of nanopore data that was used in the assembly.

19e34d83-c95f-4b7c-8614-6e67837f805c_figure2.gif

Figure 2. Genome size estimation generated by Genomescope, providing a k-mer analysis (k = 19, from Jellyfish) to estimate haploid genome size, fraction of heterozygosity and coverage.

Genomescope attempts to find k-mer count peaks, low and high coverage peaks indicating hetero- and homozygosity. (A) We find ~13× and ~28× coverage for hetero- and homozygous fractions in our dataset. Exact peak positions are determined with a log transformation. Evaluating the slope between coverage points reveals the peak positions indicating hetero- and homozygosity, for lower and higher coverage, respectively.

Table 1. Most important metrics from Genomescope.

k = 19k-mer
coverage
28.0
propertyminmax
Heterozygosity (%)3.643.65
Genome Haploid Length (bp)11,995,57012,010,675
Genome Repeat Length (bp)2,179,9172,182,662
Genome Unique Length (bp)9,815,6539,828,014
Model Fit (%) 98.2698.89
Read Error Rate (%)0.130.13

Illumina and MinION de novo genome assembly

We took six approaches to assemble the genome of DDNA#1, five assemblies based on sequencing data from a single platform (either Illumina or nanopore) and one hybrid assembly. The first approach used reads exclusively produced by the Illumina platform. After merging paired end reads we obtained ~1.7 Gbp of ~240 bp reads. Contigs generated by Spades remained short and the overall assembly was heavily fragmented. The N50 of this assembly was only ~4.3 kbp, its longest contig ~35 kbp. Spades generated 10,121 contigs and the entire assembly length was nearly twice the estimated ~12 Mbp haploid genome size. We also assembled Saccharomyces cerevisiae S288C using a similar short read dataset that was made and sequenced in parallel. Here we obtained an assembly that consisted of 768 contigs with a longer N50 of 124 kbp.

Assembly comparison of Saccharomyces cerevisiae and DDNA#1 exclusively based on Illumina data highlights that Spades clearly struggles to reconstruct the genome of our isolate, possibly due to complex SNP arrangements. From these results we take that, even under high coverage conditions, ~240 bp reads do not provide sufficient power to resolve complex SNP distributions for highly heterozygous genomes. This illustrates the necessity of increased read length to fully reconstruct complex genomic structures such as those found in DDNA#1.

Secondly, we used Spades to generate a hybrid assembly that takes both Illumina and nanopore data as input. We used ~1.7 Gbp and ~208 Mbp Illumina and nanopore data sets, respectively. This hybrid approach performed by Spades resulted in an N50 of ~379 kbp, with the longest contig ~1.1 Mbp, and a total of 653 contigs and, although still relatively fragmented, it is interesting that it yielded a similar assembly length compared to the assembly exclusively based on Illumina data. The improvement of assembly statistics strongly indicates the positive effect of longer reads in resolving complicated genomes.

Hereafter, the four remaining approaches are all based on data solely generated by the Oxford Nanopore Technologies platform. Assembly lengths in particular are fairly similar between all four assemblies and all approximate the estimated ~12 Mbp haploid genome size. However, Miniasm, TULIP and Smartdenovo outperform Canu on N50, number of contigs and longest contig (Table 2). Lengths of the longest contig from both Smartdenovo and TULIP (~2,8 Mbp) corresponds to the suggestion of ~3 Mbp chromosomes shown using pulse field gel electrophoresis on intact chromosomal DNA (Figure 1). This suggests that both Smartdenovo and TULIP were able to fully reconstruct one of the three largest chromosomes of our isolate. Although Smartdenovo results the lowest number of contigs, which is mainly due to a filtering step that filters out very short contigs (shortest contig lengths 1,716 bp and 73,332 bp for TULIP and Smartdenovo, respectively), TULIP generates the highest contiguity with N25 and N50 both around 1.6 Mbp compared to Smartdenovo that results in 1.4 Mbp and 900 kbp, respectively. Hence based on contiguity we prefer to take the TULIP result as the final assembly.

Table 2. Data characteristics and assembly statistics.

AssemblersCanuMiniasmTULIPSmartdenovoSpades
hybrid
Spades
Data typeONTONTONTONTONT and
Illumina
Illumina
Reads (#)11,34411,34411,34411,34411,3448,628,787
Coverage (x)1717171717135
GC-cont (%) 464646464647
Bases (#) 208,357,153208,357,153208,357,153208,357,153208,357,1531,688,824,952
Contigs3425282065310.121
Assembly length (bp)11,968,98912,072,13311,325,08411,732,65622,772,74622,356,011
Genome size (Mbp)12.512.512.512.512.512.5
N25 (bp)959,6471,361,4511,591,6001,429,838824,0437,876
N50 (bp)805,2061,020,1311,586,208902,730379,5884,318
N75 (bp)456,000506,710619,623456,270200,6752,041
Max length (bp)1,430,4091,569,3472,792,2032,800,0241,101,75634,707
Mean length (bp)352,029482,885404,467586,63234,8742,208
Min length (bp)4,7278,3161,71673,332128128

It is clear from these results that assemblies based on exclusively nanopore data achieve the most contiguous assemblies, as has been shown previously8,9.

We also used the nanopore datasets made with the R7.3 and R9 chemistry separately in the Canu assembler. The most notable difference between these assemblies is found in the mitochondrial genome. Only 16 kbp of this 33 kbp genome could be assembled with the R7.3 data, whereas the R9 assembly contained a complete mitochondrial genome (Genbank accession number, KC993190.1). The mitochondrial genome has a very low GC content (21%) and in the extragenic regions more A and T homopolymers are found. Very few R7.3 reads mapped to this region, but in the R9 dataset there are many more reads that represent this region (Figure 3). It has been shown that the R7.3 data especially has a bias against A and T homopolymers. Although this bias is still not fully absent29,30, it is reduced for R9 chemestry, indicating technical enhancement and suggesting improved genomic reconstruction even for low complexity regions,. Both after long read self-correction using Canu as well as for post-asssembly correction using Racon the contig sequences still contain errors15. We have used PILON and the complementary Illumina data from this strain to correct the assembled contigs twice. Homopolymer streches are paricularly difficult to base call accurately due to low complexity and lengths are usually underestimated. PILON correction leads to a minor assembly length increase since corrected homopolymer lengths adds to the final assembly size.

19e34d83-c95f-4b7c-8614-6e67837f805c_figure3.gif

Figure 3. Coverage plot of the Candida vartiovaarae DDNA#1 mitochondrial genome.

Reads from both the Illumina, and the nanopore platform were aligned to the Candida vartiovaarae mitochondrial genome (Genbank accession number, KC993190.1) to show the difference in coverage between the different platforms and chemistry versions.

Genome assembly assessment based on gene prediction

BUSCO identifies the majority of genes from database Fungi 0db9 on nanopore derived assemblies. The number of single copy genes identified ranges from 145 to 188, between 45 and 57 genes are partially recognized, and 53 to 92 genes are classified absent before PILON correction (Figure 4). After PILON correction nearly all genes are identified as single copies in the results from all four assemblers, giving support for the suggestion (based on genome size) that these assemblers yielded haploid genomes. Interestingly, gene identification on Spades contigs, particularly for our hybrid assembly, identified 269 genes as double copy genes. Together with assembly lengths of twice the estimated genome size these results strongly suggest that Spades was able to separately assemble both haplotypes forming a diploid genome under hybrid conditions. Only 100 and 67 genes are identified as double and single copy genes, respectively, for the Illumina exclusive assembly, again indicating the necessity of long read data to maximally reconstruct highly heterozygous genomes.

19e34d83-c95f-4b7c-8614-6e67837f805c_figure4.gif

Figure 4. BUSCO genomic assembly assessment using Fungi 0db9 database.

Shown on the X-axis are 5 different assembler used in this study, including a hybrid assembly approach performed by Spades. Shown on the Y-axis are the Fungi 0db9 gene counts identified by BUSCO. Dark and light coloring shades indicate before and after PILON correction per classification type, respectively.

Genome comparison

We have compared the assembled contigs of our C. vartiovaarae isolate DDNA#1 strain to yeast genome sequences that are already deposited in the nucleotide database. Comparison of our yeast strain with the well characterized S. cerevisiae assembly showed negligible genomic similarity. From 26S ribosomal RNA sequences available in the nucleotide database, Chen et al.22 have constructed a phylogenetic tree. The closest relatives for which whole genome sequences are available are C. jadinii strains CBS1600 and NBRC 0988. An initial comparison between CBS1600 and NBRC 0988 revealed that these two strains show high homology (Figure 5A). The genomic similarity between our strain and C. jadinii strains CBS1600 and NBRC 0988 is much lower (Figure 5B and Figure 5C, respectively). Assemblies exclusively based on nanopore data compared to Spades hybrid assembly strongly suggests the diploid properties of our strain, at least to a partial extend. At nearly every position on >90% of the TULIP assembly length a Spades hybrid contig is aligned. Figure 6 shows the longest TULIP contig and the third longest TULIP contig, ~2.9 and ~1.6 Mbp, respectively, and alignment of all possible Spades hybrid contigs. For TULIP contigs sorted on length we observe this double coverage behavior for contigs down to ~84 kbp. Shorter TULIP contigs tend to be less consistently double covered or even lack coverage of a Spades hybrid contig all together. In conclusion, these data show that wild type yeast strains are very heterogeneous, despite a high similarity based on ribosomal RNA ITS sequences. Therefore, the data suggest that nanopore sequencing is an essential new tool to classify yeast strains.

19e34d83-c95f-4b7c-8614-6e67837f805c_figure5.gif

Figure 5. Full genome comparisons between different yeast species.

Dashed lines indicate contigs (start and stop positions) and the area between dashed lines indicates the contig size. Blue and yellow dots are hits in reverse and forward orientation, respectively. Diagonal lines indicate sequence and synteny conservation across species. (A) Comparison between NBRC 0988 (vertical axis) and Cyberlindnera jadinii strains CBS1600 (horizontal axis) with 8 kbp as minimal hot length. (B) Comparison between Candida vartiovaarae isolate DDNA#1 (vertical axis) and Cyberlindnera jadinii strain CBS1600 (horizontal axis) with 100 bp as minimal hit length. (C) Comparison between Candida vartiovaarae isolate DDNA#1 (vertical axis) and Cyberlindnera jadinii strain NBRC 0988 (horizontal axis) with 100 bp as minimal hit length.

19e34d83-c95f-4b7c-8614-6e67837f805c_figure6.gif

Figure 6. Tablet visualization of Spades hybrid contigs aligned to TULIP contigs.

The Spades hybrid contigs aligned against longest TULIP contig (~2.8 Mbp) and the third longest TULIP contig (~1.6 Mbp). White horizontal lines indicate coverage boundaries and show that most regions on the TULIP contigs are covered twice. Alignment gaps come from heavily fragmented Spades hybrid contigs that are aligned on contiguous TULIP contigs. Visualization is based on coverage overview settings in Tablet.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 03 May 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Liem M, Jansen HJ, Dirks RP et al. De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.12688/f1000research.11146.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 03 Aug 2018
Revised
Views
10
Cite
Reviewer Report 25 Oct 2018
Jean-Marc Aury, Genoscope, Institut de biologie François-Jacob, Commissariat à l’Energie Atomique (CEA), Université Paris-Saclay, Evry, F-91057, France 
Istace Benjamin, Genoscope, Institut de biologie François-Jacob, Commissariat à l’Energie Atomique (CEA), Université Paris-Saclay, Evry, F-91057, France 
Approved with Reservations
VIEWS 10
The authors submitted a revised version of their manuscript and took into account my main concerns. However, I still have some minor issues to share.

1) The Tulip assembler require a set of seeds as input, but ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Aury JM and Benjamin I. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.16938.r36802)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
13
Cite
Reviewer Report 24 Aug 2018
Mile Šikić, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia 
Approved
VIEWS 13
The authors have addressed my concerns sufficiently to ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Šikić M. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.16938.r36799)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
12
Cite
Reviewer Report 07 Aug 2018
Christina A. Cuomo, Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA 
Approved
VIEWS 12
The authors have addressed the comments in my initial review and the genome assemblies are now public. I have a few minor comments on this revised version. In Table 1, I would recommend removing the top row and instead in ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Cuomo CA. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.16938.r36800)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 03 May 2017
Views
34
Cite
Reviewer Report 27 Jul 2017
Hayan Lee, Department of Genetics, School of Medicine, Stanford University, California, CA, USA 
Approved with Reservations
VIEWS 34
Jansen et al. used Oxford Nanopore Technology with other short read sequencing technology, HiSeq 2500, to perform high-quality de novo genome assembly and classify yeast strain isolates, Candida vartiovaarae DDNA#1 from Saccharomyces cerevisiae S288C and Cyberlindrena jadinii CBS1600/NBRC 0988. They also exploited two versions ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lee H. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.12025.r23807)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    Using similar short read data, N50 of DDNA#1 is 2.2kbp and that of S277C was 124Kbp. Probably authors want to perform repeat analysis for both strains to further study what ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    Using similar short read data, N50 of DDNA#1 is 2.2kbp and that of S277C was 124Kbp. Probably authors want to perform repeat analysis for both strains to further study what ... Continue reading
Views
35
Cite
Reviewer Report 17 Jul 2017
Christina A. Cuomo, Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA 
Approved with Reservations
VIEWS 35
This report by Jansen et al describes comparison of de novo assemblies generated using Illumina or Oxford Nanopore sequence for the yeast Candida varitovaarae.  The sequenced isolate was collected from a screen for new ethanologenic yeast species.  Genomic DNA was ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Cuomo CA. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.12025.r24005)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    1. In addition to genome size, the major factors that can influence the outcome of a de novo assembly are the repetitive sequence content, GC content, and level of heterozygosity. ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    1. In addition to genome size, the major factors that can influence the outcome of a de novo assembly are the repetitive sequence content, GC content, and level of heterozygosity. ... Continue reading
Views
37
Cite
Reviewer Report 07 Jul 2017
Jean-Marc Aury, Genoscope, Institut de biologie François-Jacob, Commissariat à l’Energie Atomique (CEA), Université Paris-Saclay, Evry, F-91057, France 
Istace Benjamin, Genoscope, Institut de biologie François-Jacob, Commissariat à l’Energie Atomique (CEA), Université Paris-Saclay, Evry, F-91057, France 
Approved with Reservations
VIEWS 37
We read the manuscript by Jansen et al. titled “De novo whole-genome assembly of a wild type yeast isolate using Nanopore sequencing” with great interest. Authors describe their strategy to sequence and assemble a yeast strain using different methodologies: a ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Aury JM and Benjamin I. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.12025.r23808)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    1. It seems that the high level of polymorphism complicate the de novo assembly. If some regions are heterozygous, it should lead to a higher than expected assembly size. We ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    1. It seems that the high level of polymorphism complicate the de novo assembly. If some regions are heterozygous, it should lead to a higher than expected assembly size. We ... Continue reading
Views
45
Cite
Reviewer Report 27 Jun 2017
Mile Šikić, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb, Croatia 
Approved with Reservations
VIEWS 45
The authors presented de novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing. They tried three different approaches to assemble the genome: using Illumina reads only, using both Illumina and nanopore reads in a hybrid approach, and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Šikić M. Reviewer Report For: De novo whole-genome assembly of a wild type yeast isolate using nanopore sequencing [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2018, 6:618 (https://doi.org/10.5256/f1000research.12025.r23377)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    Using minimap+ miniasm assembler in combination with Racon consensus tool and PILON
    • Thank you for your suggestion, this strategy is now included in our study.
     
    Using Canu ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 03 Aug 2018
    Michael Liem, Institute of Biology, Leiden University, Leiden, 2300 RA, The Netherlands
    03 Aug 2018
    Author Response
    Using minimap+ miniasm assembler in combination with Racon consensus tool and PILON
    • Thank you for your suggestion, this strategy is now included in our study.
     
    Using Canu ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 03 May 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.