YeATSAM analysis of the walnut and chickpea transcriptome reveals key genes undetected by current annotation tools

Background: The transcriptome, a treasure trove of gene space information, remains severely under-used by current genome annotation methods. Methods: Here, we present an annotation method in the YeATS suite (YeATSAM), based on information encoded by the transcriptome, that demonstrates artifacts of the assembler, which must be addressed to achieve proper annotation. Results and Discussion: YeATSAM was applied to the transcriptome obtained from twenty walnut tissues and compared to MAKER-P annotation of the recently published walnut genome sequence (WGS). MAKER-P and YeATSAM both failed to annotate several hundred proteins found by the other. Although many of these unannotated proteins have repetitive sequences (possibly transposable elements), other crucial proteins were excluded by each method. An egg cell-secreted protein and a homer protein were undetected by YeATSAM, although these did not produce any transcripts. Importantly, MAKER-P failed to classify key photosynthesis-related proteins, which we show emanated from Trinity assembly artifacts potentially not handled by MAKER-P. Also, no proteins from the large berberine bridge enzyme (BBE) family were annotated by MAKER-P. BBE is implicated in biosynthesis of several alkaloids metabolites, like anti-microbial berberine. As further validation, YeATSAM identified ~1000 genes that are not annotated in the NCBI database by Gnomon. YeATSAM used a RNA-seq derived chickpea ( Cicer arietinum L.) transcriptome assembled using Newbler v2.3. Conclusions: Since the current version of YeATSAM does not have an ab initio module, we suggest a combined annotation scheme using both MAKER-P and YeATSAM to comprehensively and accurately annotate the WGS.

The genome of a particular organism is static in all cells, unlike the dynamic transcriptome, which is the transcription of the gene space into RNA molecules in a fashion responsive to a variety of factors, such as developmental stage, tissue, and external stimuli. RNA-seq, a high-throughput RNA sequencing method, has radically transformed the identification of transcripts and quantification of transcriptional levels (Flintoft, 2008;Wang et al., 2009). It is supported by a diverse set of computational methods for analyzing the resulting data (Chakraborty et al., 2015;Chang et al., 2015;Chu et al., 2013;Fu et al., 2012;Grabherr et al., 2011;Lohse et al., 2012;Mbandi et al., 2015;Schulz et al., 2012;Simpson et al., 2009;Trapnell et al., 2009;Trapnell et al., 2012;Wang et al., 2010;Zerbino & Birney, 2008).
Rapid advances in genome sequencing technologies have generated sequences for a deluge of organisms and species. The task of annotating these sequences has been addressed by several flows. These pipelines are categorized in http://omictools.com/ genome-annotation-category and http://genometools.org/ and reviewed in (Yandell & Ence, 2012). Here, we focus specifically on MAKER-P (Campbell et al., 2014;Holt & Yandell, 2011;Law et al., 2015;Neale et al., 2014), which was used to annotate the recently published walnut genome sequence (WGS) (Martínez-García et al., 2016).
In the current study, the YeATS suite (Chakraborty et al., 2015) was enhanced to include genome annotation capabilities using RNA-seq-derived transcriptomes (YeATS annotation module -YeATSAM). First, the Trinity-assembled transcriptome obtained from twenty different tissues was compared to the WGS, excluding transcripts emanating from extraneous sources. This step incidentally revealed both biodiversity and plant-microbe interactions in walnut tree(s) from Davis, California (Chakraborty et al., 2016a). The WGS-derived transcripts were split into three open reading frames (ORFs), which were subjected to BLAST analysis using a plant proteome database obtained from the Ensembl database (Kersey et al., 2016). Transcripts can contain more than one significant ORF and must be handled differently depending on whether they map to the same or a different protein. The resulting analysis provided the WGS annotation.
Both MAKER-P and YeATSAM failed to annotate several hundred proteins annotated by the other. Many of the proteins had repetitive sequences or domains that, although difficult to detect, do not represent critical proteins during annotation. An egg cellsecreted protein (Sprunck et al., 2012), a copper chaperone (Shin et al., 2012), and a clavata3/ESR-Related protein (Kinoshita et al., 2007) were among the proteins not detected through the YeATSAM flow. Some proteins undetected in the MAKER-P flow are more significant in the context of a plant genome: several photosynthesis-related proteins encoded by the chloroplast (Nelson & Yocum, 2006) and the large family of FAD-binding berberine bridge enzymes (BBE) involved in biosynthesis of antimicrobial benzophenanthridines (Cheney, 1963;Winkler et al., 2008). We posited possible reasons for such exclusions and recommend incorporating both flows for comprehensive enumeration of genes in the WGS.
As further validation, YeATSAM was applied to chickpea (Cicer arietinum L.), an important pulse crop with many nutritional and health benefits (Jukanti et al., 2012). The RNA-seq-derived transcriptome of chickpea has also been sequenced (Garg et al., 2011) and was processed through the YeATSAM pipeline to identify ~1000 proteins that are encoded by these transcripts, but are not annotated in the NCBI database, most of which were annotated using Gnomon (Souvorov et al., 2010).

Methods
The input to YeATSAM is a set of post-assembly transcripts (∅ TRS ) and the walnut genome sequence (WGS) (Figure 1).  (Garg et al., 2011). The dataset 'represents optimized de novo hybrid assembly of 454 and short-read sequence data.' About two million 454 reads were assembled using Newbler v2.3 followed by hybrid assembly with 53409 transcripts generated by optimized short-read data assembly using TGICL, as reported previously (Garg et al., 2011). The set of annotated proteins from chickpea was obtained from the NCBI database (chickpea.pep.fasta, N=34198).
PHYML v3.0 was used to generate phylogenetic trees from alignments (Guindon et al., 2005). Multiple sequence alignment was done using ClustalW (Larkin et al., 2007) and figures were generated using the ENDscript server 2.0 (Robert & Gouet, 2014). The source code written in Perl is provided as Dataset 1 (YeATSAM.tgz). A README is provided inside the top-level directory for installation and running the programs.

Merging ORFs: broken transcripts
There are two instances in which ORFs can be merged to create a longer amino acid sequence. The first scenario occurs when a particular transcript has multiple ORFs that match to the same protein with high significance, indicating that a sequencing or assembly error has broken a contiguous ORF (Chakraborty et al., 2015). In total, 5% of the present transcripts (5,000 of 106,000) had two or more ORFs matching with high significance to the same protein, exactly mirroring the 5% error rates seen in transcripts restricted to the transcriptome from the tissue at the heartwood/sapwood transition zone in black walnut (Chakraborty et al., 2015). While most of these transcripts have repetitive elements, there were other nonrepetitive sequences with this particular problem. C20727_G1_I1 is one example: it has two ORFS, ORF_15 and ORF_36, that match a DNA repair metallo-β-lactamase family protein (Accession number: XP007043420.1) with Evalues=9E-70 and 6E-96, respectively ( Figure 2a). The two ORFs were merged (inserting the sequence 'ZZZ', although the length of the missing fragment is not known) since the Evalue of the combined ORF reduces to 2E-175 and the merged sequence was chosen as representative for the transcript. ORFs are not merged when the combined ORF did not significantly decrease the Evalue and the longer ORF was selected to represent the transcript.
The other scenario occurs when the assembler fails to merge two transcripts into a single one. In this instance, two ORFs emanating from different transcripts with significant overlaps were merged. While the merging of two ORFs was described previously (Chakraborty et al., 2015), we introduced an additional filter to select mergeable ORFs based on whether the E-value obtained by merging the two ORFs is significantly reduced. For example, transcripts C53209_G8_I1 and C53209_G6_I1 both map to the scaffold SUPER472 and their corresponding ORFs can be merged based on the sequence string 'PNRSSLP' (Figure 2b). The individual ORFs and the combined ORFs align to an autophagy-related protein (TAIR ID: AT3G49590.2) with Evalues 2e-106, 8e-63, and 1e-180, respectively. The increased significance of the combined ORF, in addition to other checks, like ensuring that mapping is to the same scaffold, adds further support to the fact that these transcripts should have been contiguous in the final assembled transcriptome.

Transcripts with multiple ORFs
About 3% of transcripts have ORFs that map to different proteins. Some transcripts should not have been merged. C1089_G1_I1 is an interesting example: a 4574 nt transcript that maps to the chloroplast and encodes two genes. One is highly variable and the other is conserved. The two ORFS, ORF_64 (fwd: 1117-2631) and ORF_108 (fwd: 3195 -4271), map to maturase K (TAIR ID: ATCG00040.1) and photosystem II reaction center protein (TAIR ID: ATCG00020.1) with very high significance. Maturase K is a good candidate for barcoding angiosperms because it has highly variable coding sequences (Yu et al., 2011), while the photosystem II reaction center protein is completely conserved (100% similarity with Arabidopsis). Another example is C19241_G1_I1 (4702 nt), split into ORF_68 (fwd: 176-3487) and ORF_115 (reverse: 4509-4096) encoding a damaged DNA binding protein (TAIR ID: AT4G05420.1) and photosystem I subunit K (TAIR ID: AT1G30380.1) with high significance, respectively. These transcripts are split in the YeATSAM flow, resulting in one ORF per transcript. Subsequently, this artifact of the Trinity assembly led to several unannotated proteins in the MAKER-P flow. Proteins unannotated by MAKER-P MAKER-P fails to annotate many key photosystem-related proteins ( Table 1). The transcript C59245_G1_I1 has ORF_43 (fwd: 176-1714) and ORF_70 (fwd: 2212-2496) mapping to photosystem II reaction center protein B (PSBB) and photosystem II reaction center protein H (PSBH), respectively. While MAKER-P does annotate PSBB, it failed to detect PSBH. These proteins map to transcripts encoding two significant ORFs (>1E-10), indicating that failure to handle this might have excluded these proteins. Also, these proteins are encoded by the chloroplast. However, this limitation of MAKER-P is not confined to transcripts emanating from the chloroplast. For example, C48031_ G3_I1 encodes a leucine-rich repeat transmembrane protein kinase (AT5G48940.1) and a metallo-β-lactamase family protein (TAIR ID: AT4G33540.1) and is mapped to scaffold 'SUPER374'. MAKER-P failed to annotate the β-lactamase family protein.
Furthermore, MAKER-P failed to annotate any FAD-binding berberine bridge enzymes (BBE) in the WGS (Kutchan & Dittrich, 1995). These enigmatic enzymes are implicated in the transformation of (S)-reticuline to (S)-scoulerine during   (Table 2  and Table 3). JrBBE1 (C54052_G1_I1) maps to the scaffold JCF7180001213852 and encodes a 564 aa long ORF, which has significant matches to Uniprot:P30986. The closest match of Uniprot:P30986 (with a low significance of 1E-07) to the MAKER-P annotation is 'WALNUT 00019959-RA', a 476 aa long cytokinin dehydrogenase. The sequence alignment of JrBBE genes to Uniprot (P30986) is shown ( Figure 3a).
As with the walnut transcriptome, the chickpea transcriptome (transHybrid.fasta: n=34760) (Garg et al., 2011) was split into three ORFs, each of which was BLAST'ed to the subset of plant proteins created from the Ensembl database. Subsequently, the ORFs with significant homology to this database (n=29263) were BLAST'ed to the set of annotated chickpea proteins in the NCBI database (n=34198). Most of these annotations were done using Gnomon (Souvorov et al., 2010) (http://www.ncbi.nlm.nih.gov/ bioproject/PRJNA190909), which analyzed ~35000 transcripts. There are ~1500 proteins identified by YeATSAM that are absent in the NCBI database (Evalue cutoff 1E-10). Some of these proteins and their corresponding genes in the TAIR database are shown (Table 4). TC00902 is an interesting example with two merged genes: a hydrogen ion-transporting ATP synthase (TAIR ID: ATMG00640.1) and a cytochrome C biogenesis (TAIR ID: ATMG00900.1). While Gnomon identified the cytochrome C biogenesis protein (Genbank: XP_004500083.1), it failed to identify the ATP synthase. Unlike MAKER-P, Gnomon generates transcripts through predictive algorithms and does not take the transcriptome as an input. Notwithstanding, these chickpea genes remain unannotated despite the presence of a straightforward method to detect them from available transcripts.

Future work
Among the ~700 genes not detected by YeATSAM, there are ~500 genes with no matches in the complete 'nr' database. Of these, ~300 have no transcripts (SetA), while the remaining ~200 have matches among the transcripts (SetB). Considering the sensitivity of RNA-seq and the wide coverage of twenty tissues, it is a definite possibility that SetA are pseudogenes. Future work in YeATSAM will focus on methods to distinguish these two classes of genes.

Conclusions
The availability of a RNA-seq-derived transcriptome from a newly sequenced organism like walnut, for which there are related annotated genomes (Arabidopsis, Vitis, etc), immensely simplifies annotation of the genome and influences the choice of annotation software. Here, we introduce a new annotation method in the YeATS suite (YeATS Annotation Module -YeATSAM), which was used to annotate the newly-sequenced walnut genome using a simple workstation. The key differentiating factor in YeATSAM is the splitting of the assembled transcriptome into multiple ORFs (Chakraborty et al., 2015). Transcripts often have more than one significant ORF that must be handled differently depending on whether they map to the same or different proteins. We show that YeATSAM failed to annotate ~700 genes identified by MAKER-P, while identifying ~4000 genes missed by MAKER-P. While most of these genes have repetitive stretches, both methods missed vital genes identified by the other. Since many of the additional genes identified by MAKER-P have no known transcripts, we posit that these were identified using ab initio methods. In the absence of such an ab initio module in YeATSAM, we propose a combined method using both MAKER-P and YeATSAM to annotate the WGS. YeATSAM was also applied to the chickpea transcriptome and identified ~1000 proteins that are not annotated in the NCBI database. This transcriptome was assembled using Newbler v2.3 (Garg et al., 2011) and most of the 34198 chickpea proteins in the NCBI database were annotated using Gnomon, the standard annotation tool (http://www.ncbi.nlm. nih.gov/genome/guide/gnomon.shtml).  Author contributions AMD and SC were involved in the study design. SC developed the software and designed pipeline that enabled the annotations, PJM-G was involved in the validation with the walnut genome sequence. SC wrote the first draft and the rest of the authors were involved in subsequent editing and modifications.

Competing interests
No competing interests were disclosed.

Grant information
AMD wishes to acknowledge grant support from the California Walnut Board.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This work focuses on a current major challenge in improving genome and transcriptome automated annotation. It also deals with difficulties derived from imperfect de novo assemblies, such as transcripts representing fused and split genes. The increasing affordability to generate sequencing data enhances the demand for more powerful annotation predicting tools and pipelines, although exact annotations will still remain for wet-lab experimentation. This paper compares the YeATSAM tool to previously annotated genomes, in which existing de novo assemblies are used as generated and analyzed with blast, interproscan or similar tools for homology-based annotation. Even though the paper indicates novelty of the method, there are critical points that need modification.
The method reported here -YeATSAM -is not clearly different from the work already reported in a previous paper . This method reported here (identify 3 longest ORFs, then blast to known proteins, then merge or split if needed) looks identical to previously published in F1000 Research -for instance, Figure  1 in the previously published paper is an almost identical replica of the Figure 1 in this paper. The current work does appropriately cite this previous paper. However, if there is a novel algorithm to describe here, it needs to be clearly delineated from this previous work. Otherwise, it should just be cited.
This previous publication also compares the annotation of the walnut genome by YeATS and Maker-P.
The previous paper and this paper both profile walnut transcripts where ORFs were merged and transcripts that match multiple proteins; this paper does use different transcripts to demonstrate the methodology and results. To emphasize the novelty of the present paper, the authors should clarify exactly what this paper offers in addition to the previous paper. In this regard, the paper does go a bit further than the previous one by detailing genes that were unannotated by MAKER-P but found via this method; those genes were not reported previously. If the algorithm has not changed from the previous work, a new focus for this paper is needed, possibly reporting these novel genes such as the BBEs.
The addition of the chickpea genome annotation is barely described -a single short results paragraph. The author also has an existing F1000 research article describing the use of YeATS on chickpea transcripts and describing the detection of missed genes and describing multiple ORFs mapping to different proteins and fragmented ORFs of the same protein . How does this report differ from that one? That one is not cited in this report Data reproducibility and accessibility -the new annotations are not made available for either walnut or chickpea (unless they are the same as the ones provided already in Chakraborty et al. 2015 ). It would be very difficult to replicate this experiment. No parameters or commands are provided to determine how PHYML, ClustalW or ENDscript server were utilized. I confirmed that YeATSAM.zip (listed as YeATSAM.tgz in manuscript) with README is available for download and the links to data are functional. I was unable to install YeATSAM; the installation and usage instructions are very vague.

Specifics:
The joined results of MAKER-P and YeATSAM look promising for improving genome annotations. However, a figure or table describing the total number of genes predicted by each software and the overlap would be very helpful to visualize the results. The report commonly has words like "several" or "many" and the usage of "~" in front of numbers. Numbers should be reported exactly where they are important to the method and results. Examples: "A BLAST database of protein peptides (plantpep.fasta: 1M seqeunces) using ~30 organisms (list.plants)" -also list.plants does not link to anything.
"About 3% of transcripts have ORFs that map to different proteins" Heading "Transcripts with multiple ORFs" -the section above also deals with transcripts with multiple ORFs. This heading could be clarified.
Some revision on the writing would improve readability. Abbreviations are recommended to be properly specified at first use in the manuscript and always in figures. Also, numbers and units should be spaced. In relation to the language, the authors are advised to review the use of scientific English, as well as verb tense consistency.
In Table 1, the main line indicates proteins not annotated by either program while the last line indicates listing of genes predicted by MAKER-P. These two sentences in the same caption lead to confusion. In addition, sizing and description of other figures might be improved. PubMed Abstract Publisher Full Text 2. Chakraborty S: RNA-seq assembler artifacts can bias expression counts and differential expression analysis -case study on the chickpea transcriptome emphasizes importance of freely accessible data for reproducibility.

F1000Research 5 Publisher Full Text
We have read this submission. We believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.
No competing interests were disclosed. Competing Interests: In this paper the authors investigate a new annotation method in the YeATS suite (YeATS Annotation Module -YeATSAM), which was used to annotate the newly-sequenced walnut genome using a simple workstation. In YeATSAM the assembled transcriptome is splitting into multiple ORFs. They show that YeATSAM failed to annotate ~700 genes identified by MAKER-P, while identifying ~4000 genes missed by MAKER-P. While most of these genes have repetitive stretches, both methods missed important genes identified by the other. Since many of the additional genes identified by MAKER-P have no known transcripts, the authors suggest that these were identified using ab initio methods. In the absence of such an ab initio module in YeATSAM, they propose a combined method using both MAKER-P and YeATSAM to annotate the WGS. This work is very interesting because the results probe the adequacy of this new annotation method. In general, the presentation is clear and the conclusions are adjusted to the results obtained. The figures and tables are also clear. Some comments are listed below: In the abstract, please change the order in "Results and Conclusions" part, from lines 17 to 21. In the abstract, please change the order in "Results and Conclusions" part, from lines 17 to 21. Consider to mention first "YeATSAM used a […] chickpea transcriptome assembled using Newbler v2.3" and then that "1000 genes were identified, which were not previously annotated by Gnomon annotation tool".
Fourth and fifth paragraphs of Introduction could be changed to the discussion and in the introduction leave some short sentences about this.
En fifth line of Methods section correct "seqeunces".
Please consider to explain further section "future work".
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: