ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Note

RNA-seq assembler artifacts can bias expression counts and differential expression analysis - application of YeATS on the chickpea transcriptome

[version 1; peer review: 2 not approved]
PUBLISHED 27 Sep 2016
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background: The unprecedented volume of genomic and transcriptomic data analyzed by software pipelines makes verification of inferences based on such data, albeit theoretically possible, a challenging proposition. The availability of intermediate data can immensely aid re-validation efforts. One such example is the transcriptome, assembled from raw RNA-seq reads, which is frequently used for annotation and quantification of genes transcribed. The quality of the assembled transcripts influences the accuracy of inferences based on them.
Method: Here the publicly available transcriptome from Cicer arietinum (ICC4958; Desi chickpea, http://www.nipgr.res.in/ctdb.html) was analyzed using YeATS.
Results and Conclusion: The analysis revealed that a majority of the highly expressed transcripts (HET) encoded multiple genes, strongly indicating that the counts may have been biased by the merging of different transcripts. TC00004 is ranked in the top five HET for all five tissues analyzed here, and encodes both a retinoblastoma-binding-like protein (E-value=0) and a senescence-associated protein (E-value= 5e-108). Fragmented transcripts are another source of error. The ribulose bisphosphate carboxylase small chain (RBCSC) protein is split into two transcripts with an overlapping amino acid sequence ”ASNGGRVHC”, TC13991 and TC23009, with length 201 and 332 nucleotides and expression counts 17.90 and 1403.8, respectively.
The huge difference in counts indicates an erroneous normalization algorithm in determining counts. It is well known that RBCSC is highly expressed and expectedly TC23009 ranks fifth among HETs in the shoot. Furthermore, some transcripts are split into open reading frames that map to the same protein, although this should not have any significant bearing on the counts. It is proposed that studies analyzing differential expression based on the transcriptome should consider these artifacts, and providing intermediate assembled transcriptomes should be mandatory, possibly with a link to the raw sequence data (Bioproject).

Keywords

RNA-seq, transcriptome, Computational genomics, chickpea, Cicer arietinum, re-validation, Intermediate assembly data, Big Data, Bioproject

Introduction

The lack of reproducibility of results in biology is a contentious subject1,2. In computational studies, the exact replication of the output of most computer programs is difficult as most non-trivial algorithms use heuristics. The problem is compounded by recent technological advances generating "Big Data" involving multiple programs and pipelines3,4. However, inferences based on these results should not be subject to the same, or ideally any, unpredictability. The availability of software used at each stage and the intermediate data generated is key in enabling debugging and tracking the veracity of results by subsequent researchers5.

Chickpea (Cicer arietinum L.) is an important pulse crop having numerous nutritional and health benefits6. Several online resources exist for chickpea genomes and transcriptomes (http://www.cicer.info/databases.php, http://www.nipgr.res.in/ctdb.html, http://gigadb.org/dataset/100076, http://nipgr.res.in/CGAP/7). Interestingly, the 68th United Nations General Assembly has declared 2016 as the International Year of Pulses (IYP).

The RNA-seq8,9 derived transcriptome of chickpea has also been sequenced10. In contrast to other traditional methods like RNA:DNA hybridization11 and short sequence-based approaches12, RNA-seq detects transcripts with very low expression levels. YeATS is a work-flow for analyzing RNA-seq data13, and was used to detect a second homolog of a polyphenol oxidase gene and ~130 genes in the large gallate 1-β- glucosyltransferase in walnut14. YeATS analysis of RNA-seq data from 20 different tissues of walnut in California unravelled detailed, tissue-specific information of ~400 transcripts encoded by a large family of resistance (R) genes and elucidated the biodiversity and possible plant–microbe interactions15.

In the current work, errors arising from the assembly step as identified by YeATS are shown to have a bearing on the transcript quantification. The chickpea transcriptome (http://www.nipgr.res.in/ctdb.html10) and its quantification in different tissues provided by an online interface is analyzed. This demonstrates that transcripts which are tagged with high counts predominantly encode multiple genes. While some studies provide the assembled sequences16,17, the author could not find the relevant data, even after personal communication18. It is also proposed that the availability of intermediate assembly sequences be made mandatory, in line with the recent initiative Global Open Data (GODAN: http://www.godan.info/).

Materials and methods

The transcriptome of the Cicer arietinum (transHybrid.fasta, ICC4958; Desi chickpea) “represents optimized de novo hybrid assembly of 454 and short-read sequence data. About 2 million 454 reads were assembled using Newbler v2.3 followed by hybrid assembly with 53 409 transcripts generated by optimized short-read data assembly reported previously10 using TGICL program” (http://www.nipgr.res.in/ctdb.html10). Quantification of transcripts in different tissues is provided by an online interface. The chickpea genome (Cicer_arietinum_GA_v1.0.fa) is obtained from http://gigadb.org/dataset/10007619.

YeATS13 analyzed the post assembly transcripts, and first excluded transcripts that did not align to the genome. A BLAST database of protein peptides (plantpep.fasta:1M sequences) using ~30 organisms (list.plants) from the Ensembl genome was created20. The three longest ORFs were obtained using the ‘getorf’ utility in the EMBOSS suite21. These ORFs are BLAST’ed22 to the 'plantpep.fasta’. We identify three classes of errors. Type I error occurs when a single transcript has multiple ORFs with significant matches to different genes. In a Type II error a single gene is broken into two separate transcripts. For a type III error, a single transcript has multiple ORFs, but they all map to the same gene. Multiple sequence alignment was done using MAFFT (v7.123b)23, and sequence alignment figures were generated using the ENDscript server24.

Results and discussion

The chickpea transcriptome (transHybrid.fasta:n=3476010) was first mapped to the chickpea genome (Cicer_arietinum_GA_v1.0.fa) obtained from http://gigadb.org/dataset/100076. There were 60 unmapped transcripts, some of which are mitochondrial transcripts (list.mito in Dataset1), some are metagenomic contamination (list.meta in Dataset1), and the rest have no match in the complete BLAST ‘nt’ database (list.nomatchinNT in Dataset1). The metagenomic transcripts are removed from further processing.

Subsequently, each transcript was split into three ORFs (list.3ORFS:n=104K in Dataset1), each of which was BLAST’ed22 to a subset of plant proteins created from the Ensembl20 database (see Methods).

Type I error: multiple ORFs mapping to different proteins

There are ~1300 transcripts encoding more than one significant peptide (see list.duplicate in Dataset1). The top five highly expressed transcripts (HET) from five tissues - flower bud (FB), mature leaf (ML), root (RT), shoot (SH), young plant (YP) - were obtained from http://www.nipgr.res.in/ctdb.html (tissues.txt in Dataset1). The number of transcripts encoding multiple genes, as found from ’list.duplicate’, were are follows - FB:4, ML:5, RT:3, SH:4 YP:5, indicating an over-representation of merged transcripts in HET.

The top five HET in the root, and the genes encoded by them are listed in Table 1. The ORFs of TC00004 encoded in the reverse direction map to a partial retinoblastoma-binding-like protein (~500 out of 549 amino acids) and the complete senescence-associated protein (157 amino acids) (Figure 1). TC00004 is computed to have highly transcribed (ranked in the top five) in all five tissues studied here (tissues.txt in Dataset1). The top ranking transcript, TC00002, has two ORFs - ORF.11 and ORF.38 (TC00002.orf in Dataset1) - which align to an ATP synthase subunit beta (327 aa) and senescence-associated protein (615 aa), respectively. This transcript is highly fragmented and encodes on both strands in an overlapping manner.

Table 1. Five highly expressed transcripts in the root of Chickpea: These are obtained from the online interface http://www.nipgr.res.in/ctdb.html.

TrLen: length of transcript, PLen: length of the full protein, OLen: length of ORF, RPKM: Reads Per Kilobase of transcript per Million mapped reads. Three out of the five transcripts have Type I errors, where two different transcripts are merged.

TrIDTrLenPLenOLenAccessionDescriptionE-valueRPKM
TC00002894615
327
79
70
XP_013443004.1
CEG35068.1
senescence-associated [M. truncatula]
ATP synthase subunit beta [P. halstedii]
1e-28
3e-36
25543
TC000043040549
157
353
193
XP_003628041.1
XP_013443005.1
retinoblastoma-binding-like [M. truncatula]
senescence-associated [M. truncatula]
0
5e-108
10290
TC228211303303323XP_004488666.1chitinase 2-like [C. arietinum]06162
TC07055652115121AII99866.1seed storage/ltp family [C. arietinum]5e-744998
TC001203859378
138
163
108
CDY45505.1
YP_173374.1
BnaCnng12640D [B. napus]
hypothetical NitaMp027 [N. tabacum]
1e-109
1e-20
4681
2ba61660-0fad-4c62-a5c8-274d38fc5468_figure1.gif

Figure 1. Type I error: different ORFs from the same transcript map to the different protein: TC00004 (3040 nt long) has ORFs encoded in the reverse direction (3040– >1) that map to different proteins: retinoblastoma-binding-like protein (RBLP: 549 aa) and the complete senescence-associated protein (SAP: 157 aa).

The RBLP ORFs are fragmented, and combine to ~500 aa. TC00004 is computed to be highly transcribed (ranked in the top five) in all five tissues studied here.

Since the merged transcripts are proximally located in the genome, it is possible that these loci are under the same transcriptional control and the expression counts are correct. However, the over-representation of such merged transcripts in HET suggests that there might be some errors in counting.

Type II error: fragmented ORFs of the same protein encoded by different transcripts

There are transcripts that encode fragmented ORFs which map to the same protein. YeATS has a merging algorithm that identifies overlapping amino acid sequences in transcripts. For example, TC13991 (35 aa) and TC23009 (110 aa) have an overlapping amino acid sequence "ASNGGRVHC", and both map to a ribulose bisphosphate carboxylase small chain (RBCSC, 180 aa) family protein (Figure 2). TC13991 has a count of 17.90, while TC23009 has a count of 1403.8. The large difference indicates an erroneous normalization algorithm, since there is only one expressed transcript of this gene (the ORF of TC13991 has no other significant match among other transcripts), although there are other genomic variants. TC23009 is ranked fifth among HET in the shoot, while all better ranked transcripts are transcripts having Type I errors. Furthermore, there is a missing 35 aa stretch in the C-terminal peptide "IGFDNVRQVQCISFIAHTPKEF", which has no match in the transcripts. A similar scenario with fragmented transcripts and a missing fragment was detected by YeATS in a polyphenol oxidase gene in walnut14.

2ba61660-0fad-4c62-a5c8-274d38fc5468_figure2.gif

Figure 2. Type II error: different transcripts that encode fragmented ORFs which map to the same protein: TC13991 and TC23009 have an overlapping amino acid sequence "ASNGGRVHC", and should have been ideally merged.

Also, their counts are significantly different - TC13991 has a count of 17.90, while TC23009 has a count of 1403.8, indicating an erroneous normalization algorithm. The ORF of TC13991 does not have other significant matches among other transcripts. Furthermore, the C-terminal peptide "IGFDNVRQVQCISFIAHTPKEF" has no matches in the transcripts.

Type III error: multiple ORFs mapping to the same protein

There are ~3000 transcripts which encode more than one ORFs mapping to the same peptide (list.splitORF in Dataset1). TC01688 is one such transcript having ORF.70 and ORF.89 (see TC01688.orf in Dataset1) mapping to an aspartyl protease family protein (TAIRid:AT1G05840.1) with BLAST bitscores 250 and 285, respectively. Merging ORF.70 and ORF.89 (inserting ‘ZZZ’) results in an increased BLAST bitscore of 507 (Figure 3). This should have minimal effects on the counts, unless Type I or Type II also occur simultaneously.

2ba61660-0fad-4c62-a5c8-274d38fc5468_figure3.gif

Figure 3. Type III error: multiple ORFs from the same transcript mapping to the same protein: An aspartyl protease (TC01688) has ORF.70 and ORF.89 mapping to TAIRid:AT1G05840.1 with BLAST bitscores 250 and 285, respectively.

The BLAST bitscore of the merged ORF increases to 507. Errors like this should have minimal effects on the counts, unless Type I or Type II also occur simultaneously.

Dataset 1.Raw data for YeATS on chickpea transcriptome.
A description of each file is provided in the text file ’Dataset Description’

Conclusions

In the current work, assembler errors have been categorized into three types. These errors have been analyzed for the chickpea transcriptome sequence, and anomalies in the quantification have been detected. The availability of assembled transcriptome sequence has enabled such analysis. It is proposed that sequences of assembled transcriptomes be linked to the Bioproject. Such initiatives have been adopted in the Global Open Data (http://www.godan.info/pages/statement-purpose) to “to make agricultural and nutritionally relevant data available, accessible, and usable for unrestricted use worldwide”.

Data availability

F1000Research: Dataset 1. Raw data for YeATS on chickpea transcriptome, 10.5256/f1000research.9667.d13681625

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 27 Sep 2016
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Chakraborty S. RNA-seq assembler artifacts can bias expression counts and differential expression analysis - application of YeATS on the chickpea transcriptome [version 1; peer review: 2 not approved]. F1000Research 2016, 5:2394 (https://doi.org/10.12688/f1000research.9667.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 27 Sep 2016
Views
62
Cite
Reviewer Report 05 Dec 2016
Lilah Toker, University of British Columbia, Vancouver, BC, USA 
Not Approved
VIEWS 62
The article "RNA-seq assembler artifacts can bias expression counts and differential expression analysis application of YeATS on the chickpea transcriptome" by Sandeep Chakraborty describes the use of a workflow developed by the author for identification of assembly artifacts in the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Toker L. Reviewer Report For: RNA-seq assembler artifacts can bias expression counts and differential expression analysis - application of YeATS on the chickpea transcriptome [version 1; peer review: 2 not approved]. F1000Research 2016, 5:2394 (https://doi.org/10.5256/f1000research.10416.r17521)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
74
Cite
Reviewer Report 11 Nov 2016
Björn Voß, University of Stuttgart, Stuttgart, Germany 
Not Approved
VIEWS 74
The article "RNA-seq assembler artifacts can bias expression counts and differential expression analysis - application of YeATS on the chickpea transcriptome" reports putative assembly artifacts in the chickpea transcriptome and discusses possible impacts on the expression counts. The main message ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Voß B. Reviewer Report For: RNA-seq assembler artifacts can bias expression counts and differential expression analysis - application of YeATS on the chickpea transcriptome [version 1; peer review: 2 not approved]. F1000Research 2016, 5:2394 (https://doi.org/10.5256/f1000research.10416.r17160)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 Nov 2016
    Sandeep Chakraborty, Celia Engineers, T. T. C Industrial Area, Rabale, Navi Mumbai, India
    24 Nov 2016
    Author Response
    Dear Dr Voß,

    I would like to thank you for taking the time to review this paper in detail, and providing constructive criticism on the overall manuscript. I also ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 24 Nov 2016
    Sandeep Chakraborty, Celia Engineers, T. T. C Industrial Area, Rabale, Navi Mumbai, India
    24 Nov 2016
    Author Response
    Dear Dr Voß,

    I would like to thank you for taking the time to review this paper in detail, and providing constructive criticism on the overall manuscript. I also ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 27 Sep 2016
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.