Amendments from Version 1

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.11859.2

Data Note

Articles

Genomics

Plant Genomes & Evolution

Draft genome sequencing of the sugarcane hybrid SP80-3280

[version 2; peer review: 2 approved]

Riaño-Pachón

Diego Mauricio

Conceptualization Formal Analysis Methodology Resources Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0001-9803-3465 a 1 2 Mattiello

Lucia

Conceptualization Methodology Resources Writing – Review & Editing 2 3 1Current address: Laboratory of Regulatory Systems Biology, Department of Biochemistry, Institute of Chemistry, University of São Paulo, São Paulo, SP, Brazil 2Brazilian Bioethanol Science and Technology Laboratory (CTBE), Brazilian Center for Research in Energy and Materials (CNPEM), Campinas, SP, Brazil 3Current address: Functional Genome Laboratory, Department of Genetics, Evolution and Bioagents, Institute of Biology, State University of Campinas, Campinas, SP, Brazil

a diriano@gmail.com

No competing interests were disclosed.

3 7 2017

2017

861

29 6 2017

2017

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Sugarcane commercial cultivar SP80-3280 has been used as a model for genomic analyses in Brazil. Here we present a draft genome sequence employing Illumina TruSeq Synthetic Long reads. The dataset is available from NCBI BioProject with accession PRJNA272769.

sugarcane long reads polyploid genomics

Centro Nacional de Processamento de Alto Desempenho em São Paulo

UNICAMP/FINEP-MCT

Fundação de Amparo à Pesquisa do Estado de São Paulo

2012/23345-0

Brazilian Bioethanol Science and Technology Laboratory

This work was supported by institutional funds from CTBE/CNPEM to DMRP and a Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP) grant to LM (2012/23345-0). The research was developed with support from CENAPAD-SP (Centro Nacional de Processamento de Alto Desempenho em São Paulo), project UNICAMP/FINEP-MCT.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Revised Amendments from Version 1

We fixed some spelling mistakes and added information and links about the genome annotation of sugarcane cultivar SP80-3280

Introduction

Sugarcane is an economically important crop used as source of sugar, ethanol and electricity generation ¹. Sugarcane has a haploid genome of ~1Gpb, however, modern sugarcane cultivars are polyploids derived from interspecific hybridization between S. officinarum L. and S. spontaneum L., reaching up to 130 chromosomes distributed among ~12 homo(eo)logous groups ^{2,
3}, with a total genome size reaching 10Gpb ⁴. Its complex genome structure has hampered genome sequencing, assembly and annotation. Partial genomic sequences are available ^{5–
8}, as well as transcriptome sequences ^{9–
11}, but there are no whole genome assemblies available to date. Here we used the Illumina TruSeq Synthetic Long Read sequencing technology to survey the genome of the polyploid cultivar SP80-3280. The generated long reads, their assembly and genome annotation have been made public and will provide useful information for functional genomics studies.

Materials and methods

The leaf rolls of greenhouse grown, two-month old plants of sugarcane cultivar SP80-3280 (provided by Centro de Tecnologia Canavieira, Piracicaba, São Paulo), were collected and immediately frozen in liquid nitrogen. The plant tissue was ground up to become fine powder, and high molecular weight DNA was extracted from 100 mg of fresh frozen tissue using CTAB (Sigma-Aldrich, USA) and chloroform:isoamyl alcohol (Sigma-Aldrich, USA) as previously described ¹². 6µg of DNA were sent to Illumina (CA, USA) for DNA sequencing using TruSeq Synthetic long read technology ¹³, through their FastTrack Sequencing Service. Sequencing was performed on an Illumina HiSeq2000 system using paired-end chemistry. Nine long read libraries, each generating approx. 600Mbps, were generated, giving an estimated coverage between 4 and 5 of the monoploid genome. A total of 1,378,917 reads longer than 1.5Kbp, or 5,642,855,018 bases, were generated. The underlying 1,966,604,928 short reads amount to 393,320,985,600bp, which would translate to an estimated coverage of 393x of the haploid genome. The maximum read length was 20,918bp, with 36% of the reads being longer than 4.5Kbp. Possible contaminants were removed by comparison against the NCBI’s nucleotide database using BLAST ¹⁴, keeping only the long reads with best hits against Viridiplantae, resulting in 1,224,061 useful for assembly. Prior to assembly, long reads originating from mitochondria (NC_008360.1) and chloroplast (NC_005878.2) were excluded using mirabait ( http://mira-assembler.sourceforge.net/). Reads longer than 1.5Kbp were assembled using Celera’s WGS Assembler v8.2 ¹⁵, using similar parameters as previously described ¹³, except for some of the error parameters that were left in their default settings, i.e., ‘unitiger=bogart, merSize=31, ovlMinLen=100’, and the parameters ovlErrorRate, cnsErrorRate, cgwErrorRate, utgGraphErrorRate, utgGraphErrorLimit, utgMergeErrorRate, utgMergeErrorLimit. A non-redundant assembly was created using CD-HIT ¹⁶, merging 100% identical sequences and sub-sequences. RNASeq data previously generated in our group ¹⁷ for the same cultivar was exploited for gene prediction using BRAKER1 ¹⁸ and PASA ¹⁹, as well as sugarcane transcript data (ESTs), and Sorghum bicolor proteins using Exonerate ²⁰, all gene evidence was integrated to generate a high quality gene prediction set with Evidence Modeller ²¹, leading to 153,078 predicted protein-coding genes.

Data availability

Raw sequencing data are available at NCBI SRA; the long reads with accession number SRX845504, and the underlying short reads with accessions SRX853961 to SRX853969. The SP80-3280 assembly is available with accession number GCA_002018215.1. All data can be found under the BioProject PRJNA272769. Genome annotation is available from https://figshare.com/projects/Sugarcane_SP80-3280_draft_genome_annotation/22327

Acknowledgements

The authors are grateful to Larissa Prado da Cruz (CTBE/CNPEM) for assistance with molecular biology procedures.

Long

Karp

Buckeridge

: Feedstocks for Biofuels and Bioenergy. In Bioenergy & Sustainability: bridging the gaps. (eds. Souza GM, Victoria RL, Joly CA & Verdade LM), UNESCO.2015;302–347. Reference Source

Grivet

Arruda

: Sugarcane genomics: depicting the complex genome of an important tropical crop. Curr Opin Plant Biol. 2002;5(2):122–127. 11856607

10.1016/S1369-5266(02)00234-0

D’Hont

: Unraveling the genome structure of polyploids using FISH and GISH; examples of sugarcane and banana. Cytogenet Genome Res. 2005;109(1–3):27–33. 15753555

10.1159/000082378

Le Cunff

Garsmeur

Raboin

: Diploid/polyploid syntenic shuttle mapping and haplotype-specific chromosome walking toward a rust resistance gene ( Bru1) in highly polyploid sugarcane (2 n approximately 12 x approximately 115). Genetics. 2008;180(1):649–660. 18757946

10.1534/genetics.108.091355

2535714

Miller

Dilley

Harkins

: Initial genome sequencing of the sugarcane CP 96-1252 complex hybrid [version 1; referees: 1 approved]. F1000Res. 2017;6:688. 10.12688/f1000research.11629.1

Grativol

Regulski

Bertalan

: Sugarcane genome sequencing by methylation filtration provides tools for genomic research in the genus Saccharum. Plant J. 2014;79(1):162–172. 24773339

10.1111/tpj.12539

4458261

Okura

de Souza

de Siqueira Tada

: BAC-Pool Sequencing and Assembly of 19 Mb of the Complex Sugarcane Genome. Front Plant Sci. 2016;7:342. 27047520

10.3389/fpls.2016.00342

4804495

de Setta

Monteiro-Vitorello

Metcalfe

: Building the sugarcane genome for biotechnology and identifying evolutionary trends. BMC Genomics. 2014;15(1):540. 24984568

10.1186/1471-2164-15-540

4122759

Mattiello

Riaño-Pachón

Martins

: Physiological and transcriptional analyses of developmental stages along sugarcane leaf. BMC Plant Biol. 2015;15:300. 26714767

10.1186/s12870-015-0694-z

4696237

Hoang

Furtado

Mason

: A survey of the complex transcriptome from the highly polyploid sugarcane genome using full-length isoform sequencing and de novo assembly from short read sequencing. BMC Genomics. 2017;18(1):395. 28532419

10.1186/s12864-017-3757-8

5440902

Belesini

Carvalho

FMS

Telles

: De novo transcriptome assembly of sugarcane leaves submitted to prolonged water-deficit stress. Genet Mol Res. 2017;16(2). 28549198

10.4238/gmr16028845

Porebski

Bailey

Baum

: Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components. Plant Mol Biol Rep. 1997;15(1):8–15. 10.1007/BF02772108

McCoy

Taylor

Blauwkamp

: Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One. 2014;9(9): e106689. 25188499

10.1371/journal.pone.0106689

4154752

Altschul

Gish

Miller

: Basic local alignment search tool. J Mol Biol. 1990;215(3): 403–410. 2231712

10.1016/S0022-2836(05)80360-2

Myers

Sutton

Delcher

: A Whole-Genome Assembly of Drosophila. Science. 2000;287(5461):2196–2204. 10731133

10.1126/science.287.5461.2196

Niu

Zhu

: CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23): 3150–3152. 23060610

10.1093/bioinformatics/bts565

3516142

Riaño-Pachón

Mattiello

Cruz

: Surveying the complex polyploid sugarcane genome sequence using synthetic long reads. Technical Memorandum Centro Nacional de Pesquisa em Energia e Materiais.2016. 10.13140/RG.2.1.3468.0565

Hoff

Lange

Lomsadze

: BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32(5):767–9. 26559507

10.1093/bioinformatics/btv661

Haas

Delcher

Mount

: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31(19):5654–66. 14500829

10.1093/nar/gkg770

206470

Slater

Birney

: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. 15713233

10.1186/1471-2105-6-31

553969

Haas

Salzberg

Zhu

: Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 2008;9(1):R7. 18190707

10.1186/gb-2008-9-1-r7

2395244

10.5256/f1000research.13012.r23980

Reviewer response for version 2

Mohan

Chakravarthi

1 Referee https://orcid.org/0000-0002-4494-7699 1Department of Genetics and Evolution, Federal University of São Carlos, São Carlos, Brazil

Competing interests: No competing interests were disclosed.

1 8 2017

2017

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve

No further comments.

Are sufficient details of methods and materials provided to allow replication by others?

Yes

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

Sugarcane genetic engineering, transcriptomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.5256/f1000research.12814.r23667

Reviewer response for version 1

Mohan

Chakravarthi

1 Referee https://orcid.org/0000-0002-4494-7699 1Department of Genetics and Evolution, Federal University of São Carlos, São Carlos, Brazil

Competing interests: No competing interests were disclosed.

21 6 2017

2017

recommendation

approve

The data note entitled ' Draft genome sequencing of the sugarcane hybrid SP80-3280' is perhaps the first report describing the whole genome of sugarcane, a complex polyploid and its availability in NCBI will be a boon to sugarcane researchers.

The study is well planned, executed and well drafted. The data presented here would be particularly useful for functional genomic studies in sugarcane.

Are sufficient details of methods and materials provided to allow replication by others?

Yes

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

Sugarcane genetic engineering, transcriptomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Riaño-Pachón

Diego Mauricio

University of São Paulo, Brazil

Competing interests: No competing interests were disclosed.

23 6 2017

Dear Dr. Mohan,

thanks you for your review of our data note. In version 2 of the note we have added links for the genome annotation in addition to the genome assembly.

Best regards,

Diego

10.5256/f1000research.12814.r23398

Reviewer response for version 1

Miller

Jason

1 Referee https://orcid.org/0000-0002-6912-2925 1J. Craig Venter Institute, Rockville, MD, USA

Competing interests: No competing interests were disclosed.

15 6 2017

2017

recommendation

approve

Summary:

The Data Note, "Draft genome sequencing of the sugarcane hybrid SP80-3280", describes a sugarcane genome assembly that is available at NCBI. The TruSeq method was applied to a monoploid sugarcane cultivar to generate a 1.2 gigabase assembly with a 8433 contig N50 according to GenBank. This is the first sugarcane genome assembly so it will be of interest to the field. This data note is especially useful because it describes the sequence filtering by size, blast, mirabit, and cd-hit prior to release.

Suggestions:

The sentence, “there are not whole genome assemblies available”, probably should say “there are no whole genome assemblies available”. The text could be made clearer by presenting all the statics for underlying short reads before getting to the synthetic long read stats, and by specifying that the blast filter was applied to the long reads. I would appreciate a reference for Celera Assembler, but that is just me.

Are sufficient details of methods and materials provided to allow replication by others?

Yes

Is the rationale for creating the dataset(s) clearly described?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Are the protocols appropriate and is the work technically sound?

Yes

Reviewer Expertise:

Genome assembly

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Riaño-Pachón

Diego Mauricio

University of São Paulo, Brazil

Competing interests: No competing interests were disclosed.

23 6 2017

Dear Dr. Miller,

thank you very much for your review of our data note. We have followed your main suggestions, and they are available as version 2 of the data note.

Best regards,

Diego