Keywords
RNA-seq, transcriptome, Computational genomics, chickpea, Cicer arietinum, re-validation, Intermediate assembly data, Big Data, Bioproject
RNA-seq, transcriptome, Computational genomics, chickpea, Cicer arietinum, re-validation, Intermediate assembly data, Big Data, Bioproject
The revised manuscript has the following changes based on the referee suggestions:
1. Changed title.
2. Analyzed the ribulose bisphosphate carboxylase small chain genes to ascertain the nucleotide homology
with corresponding transcripts.
3. Major re-writing of the Methods section.
4. Minor corrections based on referee inputs.
See the author's detailed response to the review by Björn Voß
The lack of reproducibility of results in biology is a contentious subject3,4. In computational studies, the exact replication of the output of most computer programs is difficult as most non-trivial algorithms are nondeterministic. The problem is compounded by recent technological advances generating "Big Data" involving multiple programs and pipelines5,6. However, inferences based on these results should not be subject to the same, or ideally any, unpredictability. The availability of software used at each stage and the intermediate data generated is key in enabling debugging and tracking the veracity of results by subsequent researchers7.
Chickpea (Cicer arietinum L.) is an important pulse crop having numerous nutritional and health benefits8. Several online resources exist for chickpea genomes and transcriptomes (http://www.cicer.info/databases.php, http://www.nipgr.res.in/ctdb.html, http://gigadb.org/dataset/100076, http://nipgr.res.in/CGAP/9). Interestingly, the 68th United Nations General Assembly has declared 2016 as the International Year of Pulses (IYP).
The transcriptome of chickpea has been sequenced1 using RNA-seq10,11. In contrast to other traditional methods like RNA:DNA hybridization12 and short sequence-based approaches13, RNA-seq can detect transcripts with very low expression levels by increasing sequencing depth. YeATS is a work-flow for analyzing RNA-seq data2, and was used to detect a second homolog of a polyphenol oxidase gene and ~130 genes in the large gallate 1-β-glucosyltransferase in walnut14. YeATS analysis of RNA-seq data from 20 different tissues of walnut in California unravelled detailed, tissue-specific information of ~400 transcripts encoded by a large family of resistance (R) genes and elucidated the biodiversity and possible plant–microbe interactions15.
In the current work, errors arising from the assembly step as identified by YeATS are shown to have a bearing on the transcript quantification. The chickpea transcriptome (http://www.nipgr.res.in/ctdb.html1) and its quantification in different tissues provided by an online interface is analyzed. This demonstrates that transcripts which are tagged with high counts predominantly encode multiple genes. While some studies provide the assembled sequences16,17, the author could not find the relevant data, even after personal communication18. It is also proposed that the availability of intermediate assembly sequences be made mandatory, in line with the recent initiative Global Open Data (GODAN:http://www.godan.info/).
The transcriptome of the Cicer arietinum (transHybrid.fasta, ICC4958; Desi chickpea) “represents optimized de novo hybrid assembly of 454 and short-read sequence data. About 2 million 454 reads were assembled using Newbler v2.3 followed by hybrid assembly with 53 409 transcripts generated by optimized shortread data assembly reported previously1 using TGICL program” (http://www.nipgr.res.in/ctdb.html1). Quantification of transcripts in different tissues is provided by an online interface. The chickpea genome (Cicer_arietinum_GA_v1.0.fa) is obtained from http://gigadb.org/dataset/10007619.
There are three classes of errors. Type I error occurs when a single transcript has multiple ORFs with significant matches to different genes. In a Type II error a single gene is broken into two separate transcripts. For a type III error, a single transcript has multiple ORFs, but they all map to the same gene. Multiple sequence alignment was done using MAFFT (v7.123b)20, and sequence alignment figures were generated using the ENDscript server21. ORFs were obtained using the ‘getorf’ utility in the EMBOSS suite22.
The focus of this study are highly expressed transcripts, which result from have Type I errors. The top five transcripts from different tissues are extracted from the online interface, and the three longest ORF from each transcript BLAST’ed to the ’nr’ BLAST database. These ORFs can overlap on different strands, with the assumption that the homology results will select a unique ORF. The transcripts having more than one unique ORF with a significant match (E-value=1E-8, BLAST bitscore=~75) are noted.
In order to analyze Type II errors, all transcripts need to be assigned to ORFs. YeATS first excluded transcripts that did not align to the chickpea genome (using BLAST), since transcriptomes can have contamination’s15,23. These are not removed in some studies before making inferences16, which is worrying since there are mis-annotation of certain genes (a transcript, CSTC018609, encoding a heat shock protein from Cladosporium cladosporioides has been annotated as a saffron gene)23. While using the full ’nr’ BLAST database is the best option, a smaller BLAST database of protein peptides (plantpep.fasta:1M seqeunces) using ~30 organisms (list.plants) from the Ensembl genome was created to reduce computational times24.
The chickpea transcriptome (transHybrid.fasta:n=347601) was first mapped to the chickpea genome (Cicer_arietinum_GA_v1.0.fa) obtained from http://gigadb.org/dataset/100076 using BLAST (E-value=1E-8, BLAST bitscore=~75). There were 60 unmapped transcripts, some of which are mitochondrial transcripts (list.mito in Dataset1), some are contamination (list.meta in Dataset1), and the rest have no match in the complete BLAST ‘nt’ database (list.nomatchinNT in Dataset 1). The metagenomic transcripts are removed from further processing.
Subsequently, the three longest ORFs from each transcript (list.3ORFS:n=104K in Dataset 1) was BLAST’ed25 to a subset of plant proteins created from the Ensembl24 database (E-value=1E-8, BLAST bitscore=~75).
There are ~1300 transcripts encoding more than one significant peptide (see list.duplicate in Dataset 1). The top five highly expressed transcripts (HET) from five tissues flower bud (FB), mature leaf (ML), root (RT), shoot (SH), young plant (YP) were obtained from http://www.nipgr.res.in/ctdb.html (tissues.txt in Dataset1). The number of transcripts encoding multiple genes, as found from ’list.duplicate’, were are follows FB:4, ML:5, RT:3, SH:4 YP:5, indicating an over-representation of merged transcripts in HET.
The top five HET in the root, and the genes encoded by them are listed in Table 1. The ORFs of TC00004 encoded in the reverse direction map to a partial retinoblastoma-binding-like protein (~500 out of 549 amino acids) and the complete senescence-associated protein (157 amino acids) (Figure 1). TC00004 is computed to be highly transcribed (ranked in the top five) in all five tissues studied here (tissues.txt in Dataset 1). The top ranking transcript, TC00002, has two ORFs - ORF.11 and ORF.38 (TC00002.orf in Dataset 1) - which align to an ATP synthase subunit beta (327 aa) and senescence-associated protein (615 aa), respectively. This transcript is highly fragmented and encodes on both strands in an overlapping manner.
The RBLP ORFs are fragmented, and combine to ~500 aa. TC00004 is computed to be highly transcribed (ranked in the top five) in all five tissues studied here.
TrLen: length of transcript, PLen: length of the full protein, OLen: length of ORF, RPKM: Reads Per Kilobase of transcript per Million mapped reads. Three out of the five transcripts have Type I errors, where two different transcripts are merged.
When two genes are adjacent to each other in the genome, and are both transcribed it is possible that an assembler merges these into one transcript. Also, it is possible that these loci are under the same transcriptional control and the expression counts are correct. Thus, high levels of expression of one gene should correlate with high expression level of the other, assuming a proper normalization. However, the over-representation of such merged transcripts in HET suggests that there might be some errors in counting.
There are transcripts that encode fragmented ORFs which map to the same protein. YeATS has a merging algorithm that identifies overlapping amino acid sequences in transcripts. For example, TC13991 (47 aa) and TC23009 (110 aa) have an overlapping amino acid sequence "ASNGGRVHC", and both map to a ribulose bisphosphate carboxylase small chain (RBCSC, 180 aa) family protein (Figure 2). There are three genes coding for RBCSC in chickpea. RBCSC-A and RBCSC-B map to chromosome 2 (CA2 in Cicer_arietinum_GA_v1.0.fa), and share a high degree of homology. Furthermore, there are two transcripts that map to RBCSC-A and B (TC23009 and TC01487). However, the high degree of homology of these genes make it difficult to assign each exactly to the two genomic regions (Table 2 and Table 3). Another gene RBCSC-C maps to SCAFFOLD2575 in the genome and is not considered further. The TC01487 ORF has slightly more homology to the RBCSC-A ORF, although the nucleotide sequences show that it more similar to RBCSC-B. The high sequence homology notwithstanding, TC23009 and transcripts encode significantly different amino acid sequences (40% identity), and there is an extra intron in TC23009 (Table 3). The N-terminal fragment TC13991 is slightly biased towards RBCSC-A both through nucleotide sequences and ORFs. Therefore, since the amino acid sequences from the end of TC13991 ORF and the beginning of TC23009 ORF overlap, TC23009 can be assigned to RBCSC-A. However, the expression counts of TC13991 and TC23009 are not in agreement (Table 4). TC13991 has a count of 17.90, while TC23009 has a count of 1403.8. The large difference indicates an erroneous normalization algorithm, since there is only one expressed transcript of this gene (the ORF of TC13991 has no other significant match among other transcripts, E-value=1E-8, BLAST bitscore=~75). TC23009 is ranked fifth among HET in the shoot, while all better ranked transcripts as transcripts having Type I errors. Even if TC13991 were to correspond to RBCSC-B, the absence of the N-terminal of RBCSC-A that has a high level of expression is a concern. Furthermore, there is a missing 22 aa stretch in the C-terminal peptide "IGFDNVRQVQCISFIAHTPKEF", which has no match in the transcripts. A similar scenario with fragmented transcripts and a missing fragment was detected by YeATS in a polyphenol oxidase gene in walnut14.
Also, their counts are significantly different - TC13991 has a count of 17.90, while TC23009 has a count of 1403.8, indicating an erroneous normalization algorithm. The ORF of TC13991 does not have other significant matches among other transcripts. Furthermore, the C-terminal peptide "IGFDNVRQVQCISFIAHTPKEF" has no matches in the transcripts.
Two (RBCSC-A and RBCSC-B) of them map to chromosome 2, and the third (RBCSC-C) to SCAFFOLD2575. The transcripts mapping to RBCSC-A and RBCSC-B are TC13991, TC23009 and TC01487. TC13991 and TC23009 can be merged into one transcript based on an overlapping amino acid sequence "ASNGGRVHC". There is no other transcript homologous to TC13991 - either through nucleotide or peptide BLAST. TRS: transcript, FR: Fragment.
For example, TC23009 has the exact homologies to both genomic regions, and has an intron (of length 131 nt in RBCSC-A and 137 nt in RBCSC-B). TC01487 does not have an intron. However, as mentioned previously, TC13991 and TC23009 can be merged based on a overlapping sequence. GS: start in genomic sequence of ORF in chromosome 2, GE: end in genomic sequence of ORF in chromosome 2, BBS = BLAST bitscore.
Although TC13991 and TC23009 map to the same gene, there is a huge discrepancy in their counts.
Gene | Transcript | Length (nt) | Shoot | Root | Mature leaf | Flower bud | Young pod |
---|---|---|---|---|---|---|---|
RBCSC-A | TC13991 TC23009 | 201 332 | 17.9 1403.8 | 0 0 | 0 441.5 | 0 99.5 | 0 153.8 |
RBCSC-B | TC01487 | 617 | 49.2 | 0 | 23.2 | 0 | 3.7 |
RBCSC-C | TC10770 | 538 | 17.9 | 0 | 0 | 0 | 11 |
There are ~3000 transcripts which encode more than one ORFs mapping to the same peptide (list.splitORF in Dataset 1). TC01688 is one such transcript having ORF.70 and ORF.89 (see TC01688.orf in Dataset 1) mapping to an aspartyl protease family protein (TAIRid:AT1G05840.1) with BLAST bitscores 250 and 285, respectively. Merging ORF.70 and ORF.89 (inserting ‘ZZZ’) results in an increased BLAST bitscore of 507 (Figure 3). This should have minimal effects on the counts, unless Type I or Type II also occur simultaneously.
The BLAST bitscore of the merged ORF increases to 507. Errors like this should have minimal effects on the counts, unless Type I or Type II also occur simultaneously.
In the current work, assembler errors have been categorized into three types. These errors have been analyzed for the chickpea transcriptome sequence, and anomalies in the quantification have been detected. The availability of assembled transcriptome sequence has enabled such analysis. It is proposed that sequences of assembled transcriptomes be linked to the Bioproject. Such initiatives have been adopted in the Global Open Data (http://www.godan.info/pages/statement-purpose) to “to make agricultural and nutritionally relevant data available, accessible, and usable for unrestricted use worldwide”.
F1000Research: Dataset 1. Raw data for YeATS on chickpea transcriptome, 10.5256/f1000research.9667.d13681626
I gratefully acknowledge Mridul Bhattacharjee and Nitin Salaye for logistic support.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 06 Dec 16 |
read | |
Version 1 27 Sep 16 |
read | read |
Click here to access the data.
Spreadsheet data files may not format correctly if your computer is using different default delimiters (symbols used to separate values into separate cells) - a spreadsheet created in one region is sometimes misinterpreted by computers in other regions. You can change the regional settings on your computer so that the spreadsheet can be interpreted correctly.
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (1)