Translate gene sequence into gene ontology terms based on statistical machine translation [ version 1 ; referees : 1 approved with reservations , 2 not approved ]

This paper presents a novel method to predict the functions of amino acid sequences, based on statistical machine translation programs. To build the translation model, we use the “parallel corpus” concept. For instance, an English sentence “I love apples” and its corresponding French sentence “j’adore les pommes” are examples of a parallel corpus. Here we regard an amino acid sequence like “MTMDKSELVQKA” as one language, and treat its functional description as “0005737 0006605 0019904 (Gene Ontology terms)” as a sentence of another language. We select amino acid sequences and their corresponding functional descriptions in Gene Ontology terms to build the parallel corpus. Then we use a phrase-based translation model to build the “amino acid sequence” to “protein function” translation model. The Bilingual Evaluation Understudy (BLEU) score, an algorithm for measuring the quality of machine-translated text, of the proposed method reaches about 0.6 when neglecting the order of Gene Ontology words. Although its functional prediction performance is still not as accurate as search-based methods, it was able to give the function of amino acid sequences directly and was more efficient. Wang Liang ( ) Corresponding author: wangliang.f@gmail.com Liang W and Kai Yong Z. How to cite this article: Translate gene sequence into gene ontology terms based on statistical machine 2013, :231 (doi: translation [version 1; referees: 1 approved with reservations, 2 not approved] F1000Research 2 ) 10.12688/f1000research.2-231.v1 © 2013 Liang W and Kai Yong Z. This is an open access article distributed under the terms of the Copyright: Creative Commons Attribution , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated Licence with the article are available under the terms of the (CC0 1.0 Public domain dedication). Creative Commons Zero "No rights reserved" data waiver The author(s) declared that no grants were involved in supporting this work. Grant information: Competing interests: No competing interests were disclosed. 01 Nov 2013, :231 (doi: ) First published: 2 10.12688/f1000research.2-231.v1 1


Introduction
Determining the functions of proteins is a central problem in biology. There are many databases like RefSeq (http://www.ncbi.nlm. nih.gov/refseq/) that store amino acid sequences and their corresponding functions. However, almost all protein functional prediction methods rely on the identification, characterization, or quantification of sequence similarities between the available proteins of interest 1 . Even sequences that are similar do not necessarily have identical function. A sequence may be similar to many other sequences, so it can be difficult to choose the most appropriate one. In addition, there is no way to deduce function if there are no similar sequences in any available database.
At present, many machine learning research methods are being applied to deal with the above mentioned problems. Two examples are, Support Vector Machines (SVM); a supervised learning model which is mainly used for classification and regression analysis, and network-based methods such as protein interactome networks 2,3 . In addition, the sequence features, structures, evolution of amino acid sequences, and other characteristics are also utilized for functional prediction 4,5 .
The same sequence may have different meanings in different contexts. Different sequences may also represent the same meaning. These problems are normally called "lexically/structurally ambiguous" in the language translation field. Current machine translation research could effectively avoid these problems by building a translation model between two 'languages'.
Since many amino acid sequences and their corresponding functional description are already publically available, it is possible to build their translation relationships by mature machine translation technology. We address this task in this paper. When presented with a new amino acid sequence we aimed to provide a function-based translation.
To build the model, we need to construct the "parallel corpus" of amino acid sequences and their functional descriptions. The term "parallel corpus" is typically used in linguistic areas to refer to texts that are translations of each other. A parallel corpus contains the data for two languages.
For example, a parallel corpus from English to French translation:

English sentence: I love apples
French sentence: J'adore les pommes Here, we selected amino acid sequences and their functional descriptions in Gene Ontology form to build the corpus 6 . The reason for using the Gene Ontology is that it gives the unified descriptions for protein functions of all species. The Gene Ontology data can be found on the website www.geneontology.org (http://www.geneontology.org/ GO.downloads.annotations.shtml). The corresponding amino acid sequence data can be downloaded from the website www.uniprot.org (http://www.uniprot.org/downloads), which is a central repository of amino acid sequence. We use the database ID to identify a sequence and find its correspondences in above two data sources.
An example of a parallel corpus from "amino acid sequence" to "function" translation: Amino acid sequence (amino acid):

M T M D K S E LV Q K A K L A E Q A E RY D D M A A A M K AV T E -QGHELSNEERNLLSVAYKNVV
Function description (Gene ontology term ID):

0006605 0019904 0035308 0042470 0042826 0043085
The basic idea of machine translation is very simple. It tries to infer "rules" from the parallel corpus and build a translation model. The "rules" are mainly the words or phrase correspondences between two languages. Similar sequences will get near translation results. Although the application of translation technology in protein functional prediction seems reasonably simple, as far as we know, there is no existing program that applies machine translation to protein sequence function analysis.

Parallel corpus
Here we build three groups of gene parallel corpora. Group 1 only contains human amino acid data. It has 45,538 amino acid sequences and related Gene Ontology descriptions. Group 2 is a mixed dataset of several model species. It contains about 100,000 amino acid sequences. Group 3 is the largest dataset. It contains more than 2,000,000 amino acid sequences. All of the corpora can be downloaded from www.geneontology.org (Gene Ontology data) and www.uniprot.org (amino acid sequence data). We also provide these Gene ontology and amino acid sequence data in 'data' directory of supplementary material with this paper.
First, we need to "clean" the corpus. Data cleaning is the process of detecting and correcting (or removing) corrupted or inaccurate records from a dataset. For machine translation, sequences with significant differences in length between the sources and targets should be removed. The empty and overlong sequences should also be deleted. For amino acid data, we deleted long sequences (amino acid sequence length>1000 (amino acids), Gene ontology terms number >100) and empty sequences.
After data cleaning, we had about 14,000 pieces of data for 45,538 human amino acids. Data of groups 2 and 3 are also reduced to about 1/4 of the original data size.
Amino acid sequences are not naturally segmented. So we still need to "segment" them.
Text segmentation means dividing a string of written language into its component words. In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word delimiter. But for some East-Asia languages like Chinese, there are no natural delimiters. To process these languages, we need to segment sentences into "words" sequences first. For example: A Chinese sentence:

我爱苹果(Iloveapple)
We need its word sequence form:

我 爱 苹果(I love apple)
Segmenting methods on processing East-Asia languages are applied into amino acid sequence segmentation. Here we segment/tokenise amino acid sequences by an unsupervised segmentation method 9 . Its main idea is to evaluate the probabilities of all possible words by Expectation Maximization (EM), an iterative method for finding maximum likelihood estimates of parameters in statistical models 10 , or other statistical methods like the n-grams language method 11 . Then for a sequence, we select the segmenting form with maximum probability as the segmentation of this sequence.
An example of parallel corpus about "segmented amino acid sequence" to "function": Amino acid sequence (amino acid, segmented):

0006605 0019904 0035308 0042470 0042826 0043085
After segmenting, we also delete the obviously mis-aligned sequences ('word' number ratio>9. For above example, there are 10 amino acid words and 7 Gene ontology word. So its word number ratio is 10/7) in corpus.

Translation model
There have been many open source translation systems. Normally, the Moses system is treated as the baseline system 12 . Moses is a statistical machine translation system that allows you to automatically train translation models for any language pair. All you need is a collection of translated texts (parallel corpus). Once you have a trained model, an efficient search algorithm quickly finds the highest probability translation among the exponential number of choices. Here we also use the Moses system to build the translation model for amino acid sequence to Gene ontology terms. Moses offers two types of translation models: phrase-based and tree-based. We selected the phrase-based model.
We used 95% of the data to train the model, and the remaining 5% of the data for testing.
The input for Moses is the parallel corpus. All parameters are set by default values (more details reference in the "data" and "code" sections of the Supplementary materials). The language model for the Gene ontology sequence is set as 2-grams (in the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text. For more information, see Xiaping Ge, Wanda Prat, and Padhratic Smyth 10 ). Then we could get an amino acid sequence to Gene ontology translation model. The detailed process is provided in the 'code' directory of Supplementary materials. Its main steps are described as in Figure 1.
We used the Bilingual Evaluation Understudy (BLEU) score to judge the performance of the translation system 13 . BLEU is an algorithm for evaluating the quality of text which has been machinetranslated from one natural language to another. The main idea of BLEU score can be described by formula 1. In this formula m is the number of words from the candidate translation that are found in the reference translation, and w t is the total number of words in the candidate translation.
For example, a candidate translation is "I air apple". The reference translation is "I love apple". There are 3 words in the candidate translation. Two words, "I" and "apple", appear in the reference translation. So the BLEU score of this translation is 2/3=0.67.
The improved version of BLEU normally considers the n-grams match. Since we don't care about the order of Gene ontology words, we only needed to use the unigram form of BLEU.
Because different amino acid sequence segmentation methods will produce completely different "word" sequences, and the segment methods mainly rely on the selection of maximal word length, we tried different maximal word lengths. The relation of maximal word length and the BLEU score are shown in Figure 2. Figure 2 shows maximal word length of 7 or 8 letters is a good choice for amino acid translation. In theory, we need 20 n letters corpus to train an n word length segmenting model. For n=8, we need about 26 Gigabytes amino acid sequence data. For n=9, we need 512 Gigabytes of data. Now there are only about 20 Gigabytes in annotated amino acid sequence data in Gene ontology databases (http://www.geneontology.org/GO.downloads.annotations.shtml). So a practical word length selection could be 7.
If we don't segment the amino acid sequence and treat every amino acid letter as a "word", 20+ amino acid letters could only express 20+ functions. But from Figure 2, we deduct that we could still get a translation model if we have sufficient corpus. This was mainly because the training system could combine some consecutive letters into "phrases" in the alignment process. So such "phrases" could represent more things.
In Figure 2, we could also find the increasing of the size of corpus could obviously improve the performance of translation. So far now, we tried a 5,000,000 amino acid sequences data set. Its BLEU score could reach 0.6. We also provide the method to build 40,000,000 amino acid sequences corpus data in 'get_gene_data' directory of supplied materials.
For a usable translation system, its unigram BLEU score should be more than 0.5. Our amino acid sequence to Gene ontology translation model has reached this level. So we could compare it with conventional search based method.

Comparing with search-based method
The sections above show that the "amino acid sequence" to "protein function" translation is workable. Next, we compared it with current protein functional prediction methods.
We designed the comparison as follows: (1) Dataset: we divided the parallel corpus into 95% training corpus and 5% test corpus.
(2) For the translation-based method: we trained a "amino acid sequence" to "Gene ontology" translation model based on the training corpus and then predicted the functions of a test amino acid sequence based on this translation model.
(3) For the search-based method: we indexed the amino acid sequences in the training corpus by BLAST to build a training sequences database. To predict the function of the test sequences, we searched the test amino acid sequence in database of training sequences respectively. The Gene ontology description of the best match sequence was regarded as the function of this test sequence. (4) Comparing method: we compared the predicted functions of two methods by the BLEU score. The results are shown in Table 1.
From Table 1, we found that the functional prediction performance of the translation method was still not as good as BLAST, but there were about 1%-5% sequences whose functions the BLAST method could not predict. On the other hand, the translation method could

Author contributions
Wang Liang designed the experiments, wrote the codes and trained the translation model. Zhao Kaiyong prepared the experimental data and provided the biology knowledge supports.

Competing interests
No competing interests were disclosed.

Grant information
The author(s) declared that no grants were involved in supporting this work.
predict the functions of all amino acid sequences. Here, we mainly want to show the feasibility of the new method. More datasets and translation methods should be tested in the future.

Summary
Based on statistical machine translation technology, we present a novel method to predict the function of amino acid sequences. Although its performance remains to be improved, it shows that the application of machine translation in protein functional prediction promising. For statistical translation methods, the more corpuses we generate, the better results we would obtain. Google's translation system has proven that vast amounts of corpus with a few rules, or even without any rules, can produce excellent translation results 14 . Since the protein functional prediction problem is so close to 'language translation', translation based protein functional prediction is likely to be the most

Supplementary materials
There are 5 directories in Supplementary materials.
Directory list: 1. "data": parallel corpus for Gene ontology sequence and corresponding amino acid sequence. 2. "code": codes and directions for building translation model. 3. "get_gene_data": direction for downloading data from geneontology.org(Gene ontology data) and uniprot.org(amino acid sequence data) 4. "pre_process_gene_data": convert the format of original Gene ontology data and amino sequence data to the required format of Mose system. 5. "build_dict": direction for building amino acid dictionary. Because we don't consider the order of go terms, 47.4 is just its BLEU score (unigram). Here we didn't run the tuning process. You could divide the test corpuses into test part and tuning part, and then run the tuning process. It could bring some improvement for BLEU score, but it may take several days.

Directory "/get_gene_data", get gene data
If you want build your own corpus from original database source (www.geneontology.org, Gene ontology data. www.uniprot.org, amino acid sequence data), see this section:

Get gene data
Here we will show how to get corpus from original data source. You need install gunzip. Using human data:

Directory "build_dict", build amino acid dictionary
Moreover, if you want to build your own "protein.dict" for different maximal word length, we also give an example: Step 1, install SRILM A simple method to build dictionary is to use language model. Here use SRILM. You should install it first. (http://www.speech.sri.com/ projects/srilm/) Then cp the executable file "ngram-count" to this directory. Normally in your install dir. There have been an "ngram-count", but it could only run in a special Linux version, so just overwrite it.
Step 2, you should make 2 file (1) In ./get_gene_word: run make (2) In ./get_gene_word_prob: you should set the srilm install path and MACHINE_TYPE in Makefile, then run make.

Juliana Bernardes
Programa de Engenharia de Sistemas e Computação, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil The paper present a new method to predict protein function based on machine translation concepts. In this scenario, protein sequences and their gene ontology terms are interpreted as words in different languages. The idea is to find the best translation for a protein sequence, that is, its most likely gene ontology terms. The method has two phase: training and inference. During the training, protein sequences are broken into several segments and then these segments are related to gene ontology terms. This mapping between protein segments and gene ontology terms constitutes the parallel corpus and it is learned during the training phase. Next, the inference is carried out by an statistical machine translation system that translates a new protein sequence into gene ontology terms by using the collection of translated pairs (parallel corpus) learned previously.
My main issue with this paper is the way how sequences are broken into segments. It is well known that some segments such as motifs and domains are essential to predict protein functions. The author must explore them to enrich their model.
A second point concerning the results, they are disappointing. The method did not outperform BLAST that is one of the simpler existing method for protein function prediction. The comparison was not properly done, what is the false positive rates for both methods? Authors must compare the methods by using classical measures such as precision, recall and F-measure and a ROC curve could be plotted. Moreover, the authors did not compare their approach with state-of-art methods such as SVM and HMMs. The discussion of results is really poor, the authors did not defend their method.
The training and testing strategy is unusual. 95% of the data to train and just 5% to test, the author have performed a cross validation? If yes how many folds? how many times it was repeated? It is not clear in the text.
The paper is not well written and there are several incompressible parts. Just to cite some of them: "Then we use a phrase-based translation model to build the "amino acid sequence" to "protein function" translation model. (I think it is "to map" instead of "to build") ○ Here we build three groups of gene parallel corpora. Group 1 only contains human amino acid data. (the expression "human amino acid data" is so unusual in bioinformatics community).

○
The author wrote in the conclusion "Since the protein functional prediction problem is so close to 'language translation', translation based protein functional prediction is likely to be the most promising approach." I do not agree that protein functional prediction is close to language translation problem. There are a lot additional points, for example evolution. In my opinion, we cannot interpret proteins as words. Anyway, the results have not showed that it is the most promising approach.
What about sequence words that should correspond to multiple GO terms?
○ Is the sequence of words biologically important? The "Best BLAST" method that the authors compare with (NB: I wouldn't call that approach "search-based" as that name is extremely underspecified; rather I would suggest "sequence similarity-based") uses local sequence similarity; some regions of the sequence will not be considered.

○
It certainly isn't obvious to me that linear order should matter at all on the output side; it appears that the GO sequences are listed in numeric order by ID and a "2-gram language model" over GO sequences should not carry any more information than a 1-gram model would, as the IDs themselves are arbitrary. This representation is not well-justified.
○ Segmentation: The authors use a EM-based approach to drive segmentation, which essentially captures the statistical properties of particular subsequences. There should be some relationship to work on motif finding here; the authors should identify some relevant work on that topic (e.g., Tompa et al. 1 , although there are likely many more reviews available; as this is not my area specifically I suggest the authors do their own search). The authors should also compare to a far simpler segmentation approach: just splitting the sequence into words of fixed length. This is a character-level n-gram representation. Perhaps all the work of running EM (and the associated large number of data points) is completely unnecessary. (The authors would of course need to try a few different lengths for this to be meaningful.) Essentially, the strategy needs to be justified, and the impact on the broader function prediction task must be quantified. What is the relationship between the maximal word length and the number of output terms included in the output GO sequence? Is there a length bias on the output side that is also reflected in Figure 2? Again, this hinges on the appropriateness of characterising this task in terms of the translation model.

Data:
The authors do not clearly state whether they consider all GO annotations available for a protein, or if non-experimental evidence codes are excluded. If "Inferred by Electronic Annotation" codes are included, this means that "Best BLAST" has likely already been used to produce the annotations, which would explain at least to some extent why the performance of that method shows such strong results.

Experiment details:
The authors do not specify what e-value threshold they assume for BLAST.

Filtering:
The authors incorporate various filters along the way, and it is unclear how strongly this biases the training/test data that they utilise. The authors delete long sequences and proteins with many associated GO terms. The cleaning reduces the data set substantially, but the rationale for doing this cleaning appears to have more to do with the limitations of the approach that the authors have adopted than because of some biologically valid reason. The exclusion of such a large number of {sequence,GO annotations} pairs will certainly undermine the findings. This also applies to the deletion of the "obviously mis-aligned" sequences. How many instances does this filter? Why are these obvious mis-alignments? The number of GO terms associated with a sequence has more to do with (a) annotation bias [some sequences are far more heavily studied/annotated than others] and (b) overall progress on curating protein function information (see Baumgartner Jr. et al. 2 ).
Evaluation: Recently, the CAFA challenge explored protein function prediction from sequence 3 , and in association with that there has been detailed consideration of evaluation of this task 4 5 -the key point being that the structure of the GO should be taken into consideration in the evaluation. The authors present their results in terms of BLEU, but in their use of the unigram form, I think that BLEU simply collapses to Precision. This, then, ignores Recall completely. The evaluation presented should certainly measure Recall, and the authors should provide results that are comparable to the results of systems that participated in CAFA. They should then compare their results to the state-of-the-art methods.

Summary
In summary, the applicability of the method to this task should be more strongly justified and the experimental set-up (including the extensive filtering of data points and the evaluation metrics) should be more carefully considered. The authors' conclusions have not been adequately justified on the basis of their results. a method inspired by automatic translation of human language. The idea is to break the sequence into segments, and each segment is assigned a GO term. During training the association between segments and GO terms is learned, and then this learned model is applied to new sequences. In a benchmark against BLAST performance it is much worse at identifying the GO terms associated with a protein.
The defence offered that the work shows the feasibility of the approach falls a bit flat for me. It is feasible in the sense that "it can be attempted". I do not find the results encouraging. I disagree with the claim that protein function prediction is "so close" to language translation, except in some rarefied technical sense. Indeed, using translation seems to be obscuring the actual task, which (in this context) is to find sequences that are similar and transfer GO annotations from one to the other. At the heart of it, that is all that is happening here.
I know I am not supposed to address novelty, but I have to point out that many feature extraction methods have been proposed for sequence analysis (for example, string kernels combined with SVMs 1 ). Many methods for sequence scoring are based on methods also used for analyzing human language (e.g. HMMs 2 ), but without the baggage of "translation". The authors do not make these connections in the introduction.
In addition: The method is not described in enough detail, especially for bioinformatics researchers who may not be familiar with the translation methods (including myself). The provision of source code, while laudable, is insufficient. For example, for the segmentation method, the reader is directed to a non-peer reviewed manuscript on DNA segmentation. Why does each segment get at most one GO term? If that is necessary, it seems a big problem, since GO does not work that way. BLAST also does not have the limitations that are exposed by the requirements for cleaning described here. What makes a protein sequence "corrupted" or "inaccurate"? I do not get what the authors mean by "obviously misaligned" in comparing GO terms to protein sequences; it makes little sense and the strained relationship of the task to translation is exposed. Then authors seem to be requiring exact matches; this is also not very biological. It may be that these are fundamental limitations of the approach; if not, they must be addressed. (I did not take the time to review the source code).
The results are not described in enough detail and the evaluation is weak. By optimizing the word length on the same corpus used for testing, the authors have probably inflated the performance scores, not that it helps much. The authors claim that an advantage of the method is that it always gives a prediction, even when BLAST does not. Is it really reasonable to trade specificity for sensitivity in this way? Can the authors show any cases where this yields a useful answer? Showing some cases where a correct functional annotation was found by this method, but not by BLAST, and providing some insight into why that happened, is necessary. The authors mention in the introduction that sequences that are similar may not have the same function; are they claiming that this new method addresses this? I do not think so, so it seems irrelevant.
Given all these issues, I nearly checked the "not approved" box. But on consideration I think the work could be fixed up, from a technical standpoint.
There are also quite a few English language errors. sequence to several GO terms), we argue that one could use translation methods in protein functional prediction.

2:
The referee requests more evaluations -There still seems to be no standard data set and evaluation methods for protein functional predictions. We will find a proper data set and try different evaluation metrics. More available function prediction methods will also be tried.