RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction

Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied seven machine learning algorithms (Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Random Forest, Extreme Gradient Boosting, Neural Networks and Deep Learning) through model organisms from different evolutionary branches to create a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences. Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their trinucleotides count analysed (64 features) and we performed a normalization by the sequence length, resulting in total of 180 models. The machine learning algorithms validations were performed using 10-fold cross-validation and we selected the algorithm with the best results (eXtreme Gradient Boosting) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and TransDecoder) and our results outperformed them. Both stand-alone and web server versions of RNAmining are freely available at https://rnamining.integrativebioinformatics.me/.


Amendments from Version 1
Here, we present the revised update manuscript. In brief, the minor changes as below; We updated the abstract We updated the Introduction section with reviewer's suggestion: 1-We included the citations for BASiNET and CoDaN; 2-We added the sentence "Next, RNAmining was evaluated in another 9 phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training." We restructured the second paragraph of "Machine learning classifier algorithms selection" section and the first paragraph of "Training and testing datasets, model building and quality measuring for coding potential evaluation" section.
We added a new key point in conclusion "RNAmining was evaluated using other phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training."

Introduction
Non-coding RNAs (ncRNAs) are key functional players on different biological processes in organisms from all domains of life 1,2 . Its investigation is already routine in almost every transcriptome or genome project. Dysregulations in these molecules may lead to different types of human disease, including cancers 3 , neurological disorders 4 and cardiovascular infirmities 5 .
The genome of eukaryotic 6 organisms is, in general, majority composed of non-coding transcripts, with complex organisms estimated to transcribe more than 75% of their genomes 7 . Besides strong evidence associating these ncRNAs to key functions in the cell, most of them are not yet associated with a functional mechanism. In a transcriptome project there exists an important step in the computational identification of ncRNAs, which is the evaluation of their potential to be translated into proteins using different bioinformatics approaches 8,9 . To computationally evaluate the coding potential of a set of transcripts, available tools or algorithms normally analyse specific characteristics available in primary sequences (e.g. nucleotides counts, the existence of a trustful open reading frame).
For instance, RNAcon implements a Support Vector Machine (SVM)-based model for the discrimination between coding and non-coding sequences 10 . Coding Potential Assessment Tool (CPAT) 11 assesses the coding potential through an alignmentfree method, which uses a logistic regression model built based on different characteristics of the sequence open reading frame (ORF), which includes length, coverage and nucleotides compositional bias. TransDecoder identifies candidate coding transcripts based on other distinctive features from predicted ORFs (e.g. a minimum length ORF, a log-likelihood score, encapsulated ORF) 12 . CPC2 13 trained a SVM model using Fickett TESTCODE score, ORF length, ORF integrity and isoelectric point as features. The LIBSVM 14 package was employed by training a SVM model using the standard radial basis function kernel (RBF kernel) with the training dataset containing 17,984 high-confident human protein-coding transcripts and 10,452 non-coding transcripts 11 . CoDaN uses Generalized Hidden Marvov to generate probabilistic models based on the GC content of nucleotide sequences in order to estimate the coding regions and both 5' and 3' untranslated regions of transcripts 15 . BASiNET performs feature selection to transform nucleotide sequences as complex networks, then it generates topological measures to build a feature vector used to classify the sequences 16 .
Here, we applied and benchmarked seven different machine learning algorithms (Random Forest, eXtreme Gradient Boosting (XGBoost), Naive Bayes, K-Nearest Neighbors (K-NN), SVM, Artificial Neural Network (ANN) and Deep Learning (DL)) through 15 organisms from different evolutionary branches, in order to evaluate their performance in distinguishing coding and non-coding RNA sequences. Next, we developed a stand-alone and web server tool, called RNAmining (http://rnamining.integrativebioinformatics.me/), by selecting and implementing the algorithm with the best performance in all organisms (XGBoost). Next, RNAmining was evaluated in another 9 phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training. In total, it was evaluated through 24 organisms from the eukaryotic tree of life and its results outperformed publicly available tools commonly used for that purpose.

Machine learning classifier algorithms selection
In the classification process there is a division related to the learning paradigm, with classification algorithms divided into: (i) Symbolic, which seeks to learn by constructing symbolic representations of a concept through the analysis of examples and counterexamples (e.g. Decision Trees and Rule-based System); (ii) Statistical, which looks for statistical methods and use models to find a good approximation of the induced concept (e.g. Bayesian learning); (iii) Based on Examples (lazy systems), which aims to classify examples never seen using similar known examples, assuming that the new example will belong to the same class as the similar example (e.g. K-Nearest Neighbor); (iv) Based on Optimization, which consists of maximizing (or minimizing) an objective function or finding an optimal hyperplane that best divides two classes (e.g. SVM and Neural Networks); (v) Connectionist Representation, which represents simplified mathematical constructions inspired by the biological model of the nervous system (e.g. Neural Networks). In this benchmarking, we decided to evaluate the performance of selected algorithms from each paradigm type in the coding potential prediction of RNA sequences: Random Forest, XGBoost, Naive Bayes, K-NN, SVM and Neural Networks (ANN and Convolutional Neural Networks (CNN)).
All the machine learning methods were executed using scikit-learn (Version 0.21.3) 17 , except for Neural Network and DL models which were implemented using Keras API with Tensorflow as backend (Version 2.3.0) and XGBoost algorithm which was executed using XGBoost Library (version 1.2.0) 18 in Python Language (Version 3.8). XGBoost, K-NN and Naive Bayes models were trained with the default values. The Random Forest and SVM parameters were obtained through grid search method. The Random Forest and SVM parameters were obtained through grid search method, the best results using Random Forest resulted in a model generated with the default parameters, with the exception of the number of trees used (150 estimators) and the criterion parameter setted to 'entropy' for information gain. For SVM, the resulting model was trained with the Radial Basis Function (RBF) kernel, with the Regularization parameter (C) and Kernel coefficient (Gamma) defined in 1000 and 0.8, respectively. ANN and DL were performed with different architectures according to grid search and empirical tests. The first ANN experiment was composed of three hidden layers consisting of 32-16-8 neurons, respectively; the second ANN experiment was performed with 64-32-16-8 neurons; and the third experiment was executed with 32-32-16-8 neurons. Next, we produced four experiments with DL using 2 CNN layers, followed by 2 fully connected (dense) layers: the first experiment had 512(CNN)-512(CNN) filters and 28(Dense)-1(Dense) neurons; the second was created with 64(CNN)-64(CNN) filters and 128(Dense)-1(Dense) neurons; the third was performed with 32(CNN)-32(CNN)-128(Dense)-1(Dense) neurons; and the last was built with 128(CNN)-128(CNN)-128(Dense)-1(Dense) neurons. These layers received as input the total number of attributes (i.e. combination of trinucleotides counts, described in the next topics). The hyperparameters used to execute the DL and ANN approaches are made available in Extended data: Supplementary File S1 19 .
Training and testing datasets, model building and quality measuring for coding potential evaluation The cross-validation approach was applied in the grid search method, using the training dataset to validate the hyperparameters and obtain the best set of parameters to be used. In addition, this partition method validates the hyperparameter's results through different validation sets. Therefore, it proves that our model is working and generalizing the problem. Thus, sequences were randomly divided into training and testing datasets, using 80% of the data for training and 20% for testing. The connectionist methods (e.g. Artificial Neural Networks and Convolutional Neural Networks) demand a validation dataset to adjust the model, because of the weights optimization stage and its hyperparameters. Thus, for experiments with ANN and CNN, 20% were used for validation, 60% for training (defined as 80% for the other algorithms) and 20% for testing. The testing dataset was the same used in all machine learning algorithms. The number of sequences used for each organism for the training and test sets can be observed in Table 1. Next, we generated 180 models (i.e. one per algorithm for each organism, whereas three experiments for ANN models and four experiments for CNN models), which were further evaluated in this work.
After selection of the best model, it was applied and evaluated in other nine organisms ( Figure 1A), different from the one used in the training process, including five related Chordata and other four phylogenetically distant species. Among the chordates, the models were tested in Carassius auratus (Actinopterygii, Teleostei), Gorilla gorilla gorilla (Placentalia), Pseudonaja textilis (Sauria, Squamata), Rattus norvegicus (Placentalia) and Terrapene carolina triunguis (Sauria, Testudines). Within nonchordates species, we evaluated the model in Arabidopsis thaliana (Plantae, Eudicots), Caenorhabditis elegans (Nematoda), Drosophila melanogaster (Insecta, Diptera) and Saccharomyces cerevisiae (Fungi, Ascomycota). Finally, it was evaluated using artificial sequences containing the same nucleotides composition of the ncRNAs for each species of the testing dataset (Table 1). Ten sets of random sequences containing the same number of ncRNAs per species were generated using MEME suite Version 5.1.1 with default parameters 21 . All sequences in FASTA format with their respective Ensemble identifiers can be retrieved at RNAmining website (https://rnamining.integrativebioinformatics.me/download).

Comparisons with publicly available tools
The performance of all algorithms in the coding potential evaluation was compared with publicly available tools commonly employed for this purpose (RNAcon 10 , CPAT 11 , TransDecoder 12 and CPC2 13 ), using default parameters. It is worth noting that CPAT only made available models for H. sapiens  Taxonomic tree according to the used organisms for the models building (black color) and validation (red color). B. Pipeline used to perform the benchmarking and create the tool. Firstly, we download the coding and non-coding sequences from Ensembl; Next, we performed the trinucleotides counts and sequence normalization. After this, we created a machine learning benchmarking within the 7 algorithms and selected the one with the best performance to be implemented in the RNAmining tool (XGBoost algorithm), which was again evaluated using sequences from 9 other different species and sets of artificially generated ones. Finally, we performed a novel benchmarking with RNAmining against the public available tools for coding potential prediction.
X. tropicalis (0.25). The whole workflow of RNAmining development can be visualized in Figure 1B.

RNAmining tool implementation and availability
The XGBoost method was implemented using XGBoost Library (version 1.2.0) in Python Language (Version 3.8) and the models for each species were saved using pickle Python's library. The web server interface was developed using HTML and CSS. The connection within the front and back-end was implemented through JavaScript. The control of files and the connection with Python's scripts was performed through PHP language. RNAmining user friendly tool and its stand-alone version can be accessed at https://rnamining.integrativebioinformatics.me/. Instructions on how to use it and a whole documentation are made available. Its source code with a Docker platform can be freely obtained at https://gitlab.com/integrativebioinformatics/RNAmining.

Results
Using machine learning algorithms to improve the coding potential prediction of RNA sequences It is known that the algorithm performance in predictive analysis is influenced by particularities available in the genomes sequences of the organisms used in the training set 22 , and it Even without using any plant in the original training set, we applied the different models to predict the coding potential of Finally, in order to show that the results obtained were not by chance, we created 10 datasets of artificial sequences containing the same number, length and nucleotides composition of the coding and ncRNA sequences from the 15 species used in our testing shown in Table 1. The F1-score mean, minimum and maximum values of the 10 datasets from each organism can be visualized in Table 5. The confusion matrix and all the other metrics (accuracy, specificity, sensitivity and precision) can be found in Extended data: Supplementary File S4 19 . As we can visualize, the F1 measurement mean remained below 38.00 for all artificial sequences created for the tested organisms, with the exception of P. marinus (F1-score equals to 64.13), which still had a F1-score below to the values obtained with the other organisms tested for the coding potential prediction (Table 4).
Comparing RNAmining performance with publicly available tools Next, we compared RNAmining performance with other four tools commonly used for nucleotides coding potential prediction: CPAT, CPC2, RNAcon and TransDecoder. We used as input all coding and ncRNA sequences from the testing dataset used in the 15 species listed in Table 1. According to the F1-score metric, RNAmining outperformed all the tools in all organisms with the exception of CPAT for L. chalumnae, in which both tools presented an equal F1-score of 99.57. The comparative performance of all tools can be observed in RNAmining stand-alone and web server tool RNAmining tool was made available in both stand-alone and web server versions. The tools only require the nucleotide sequences of the RNAs in which the user intends to perform the coding potential prediction in FASTA format, together with the species name in a standardized format related to the model to be used. Besides our tool presented good results even when using phylogenetically distant organisms, we recommend to always use the most closely related species to the one the user wants to perform the predictions. Furthermore, RNAmining documentation presents all the guidelines on how to generate a model for a particular set of sequences and organisms of interest. The web interface of RNAmining tool was developed to allow users to quickly perform the coding potential prediction without the need of installing any specific program and using only a generic internet browser. The only requirement for running the tool is a FASTA file containing the nucleotide sequences and the organism model that the user wants to use, which can be selected in a drop-down menu containing all 15 organisms used in the training step (Figure 2A). There is no limit of the number of sequences, but the web server supports files up to 20Mb. For files bigger than that, we recommend using the stand-alone RNAmining tool. RNAmining will automatically classify the FASTA sequences used as input and identify which of them are coding or non-coding RNAs. Finally, as a result it offers a table with the sequences' IDs, its classification as coding or non-coding and the classification probabilities, which can also be downloaded in tabular format, together with two separate FASTA files containing both the coding and non-coding sequences separately ( Figure 2B).

Discussion
The coding potential prediction of nucleotides is a key step in the definition of the repertoire of non-coding RNAs in a genome or transcriptome project, especially when dealing with non-model organisms. Sometimes, predictive tools for the computational characterization of RNA molecules in analyses like the prediction of specific RNA families 22 or the estimation of a network of RNA-RNA 23 or protein-RNA interactions 24 , have their performance affected according to the training organism, increasing the number of false positives when applied in evolutionarily distant species. In this work, we evaluated the performances of seven different supervised machine learning algorithms, using eukaryotic species from a variety of evolutionary clades, revealing their potential to be used in the development of novel and improved computational tool for the coding potential prediction of RNA sequences. Artificial intelligence has been widely used in computational biology 25,26 , but its application to characterize ncRNAs has been limited.
In this benchmarking, we opted to analyze the trinucleotides count as the main feature to be evaluated for the coding potential prediction, followed by a normalization considering the . They used the counts of mono-, di-, tri-, tetra-and penta-nucleotides and a combination of all counts using the SVM method, and showed that using trinucleotides count is enough to predict the coding potential of ncRNAs with better accuracies. Our comparisons of the machine learning algorithms revealed XGBoost as the algorithm with better performance, presenting efficiency in predicting the coding potential of RNA sequences even when using the models of distantly related organisms. This latter shows the usefulness of this approach for performing coding predictions in non-model organisms.
We implemented XGBoost in RNAmining, a stand-alone and web server tool flexible to be used in genome or transcriptome projects focused in both model and non-model eukaryotic organisms. Our tool outperformed similar approaches, such as CPAT 11 , CPC2 13 , RNAcon 10 and TransDecoder 12 . Both versions of the software are easy to use, with the web version providing a simple report and FASTA format files that can be used in downstream analysis. It provides 15 models generated from eukaryotic from different evolutionary clades. Other models can be generated by the user using the stand-alone version, which can be used with simple command line operations. These features facilitate its usage for experienced users and, especially, for those without any programming experience, which can easily perform large-scale predictions of the coding potential of nucleotide sequences in both genome or transcriptome initiatives.

Conclusions
• We used pattern recognition approaches to investigate the coding potential prediction of RNAs, using 64 features (all combinations of trinucleotides count).
• We performed a benchmarking from seven machine learning algorithms (Naive Bayes, SVM, K-NN, Random Forest, XGBoost, ANN and DL), through 15 model organisms from different evolutionary branches and implemented the best one (XGBoost) in a novel tool (RNAmining).
• RNAmining is a user-friendly coding potential prediction web tool that performs XGBoost algorithm to predict the coding potential of RNA sequences.
• RNAmining was evaluated using other phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training.
• A comprehensive analysis using data from 15 organisms revealed that RNAmining outperformed other tools available in literature (CPAT, CPC2, RNAcon and TransDecoder).

Data availability Underlying data
Ensembl is an open access genome browser for vertebrate genomes in the Ensembl website (https://www.ensembl.org/ index.html).
RNAmining is a tool for coding potential prediction which is freely available at (https://rnamining.integrativebioinformatics.me/ download). 3-We believe that it is possible to visualize the performance of the methods from the tables presented along the main text of the paper, as well as made available as supplementary material, since the measures used for the ROC curves construction are the same presented there. In addition, we consider that when we have too similar numbers it is easier to see the difference in a table.

4-
The cross-validation method was used in the grid search method using the training dataset to validate the hyperparameters, choosing the best set of parameters. In addition, this partition method validates the hyperparameter's results through different validation sets. Therefore, it proves that our model is working and generalizing the problem. Thus, when we had the best parameters we used the test dataset (20%) to generate the final models and to calculate the metrics. The connectionist methods (e.g. Artificial Neural Networks and Convolutional Neural Networks) demand a validation dataset to adjust the model, because of the weights optimization stage and its hyperparameters. Thus, for experiments with ANN and CNN, 20% was used for validation, 60% for training (defined as 80% for the other algorithms) and 20% for testing, in all the cases. In addition, due to the complexity of these 2 algorithms, it is more common in literature to use the holdout (training/validation/test) partition method instead of cross-validation. Thereby, we modified the sentence in the abstract: "All the machine learning algorithms tests were performed using 10-folds cross-validation..." to "The machine learning algorithms validations were performed using 10-fold cross-validation...". In addition, in the section "Training and testing datasets, model building and quality measuring for coding potential evaluation" we changed the following sentence: "Sequences were randomly divided into training and testing datasets, using 80% of the data for training and 20% for testing. For ANN and CNN experiments, sequences were split into 60% of the data for training and 20% for validation. The testing dataset was the same used in the other machine learning algorithms." to the following text: "The cross-validation approach was applied in the grid search method, using the training dataset to validate the hyperparameters and obtain the best set of this to be used. In addition, this partition method validates the hyperparameter's results through different validation sets. Therefore, it proves that our model is working and generalizing the problem. Thus, sequences were randomly divided into training and testing datasets, using 80% of the data for training and 20% for testing. The connectionist methods (e.g. Artificial Neural Networks and Convolutional Neural Networks) demand a validation dataset to adjust the model, because of the weights optimization stage and its hyperparameters. Thus, for experiments with ANN and CNN, 20% were used for validation, 60% for training (defined as 80% for the other algorithms) and 20% for testing.The testing dataset was the same used in all machine learning algorithms."

5-
The results shown in this paper are not obtained using cross-validation. The crossvalidation method was used in the grid search method to validate the hyperparameters, choosing the best set of parameters. Thus, we can validate the hyperparameter's results through different validation sets. Therefore, it proves that our model is working and generalizing the problem. Thus, to generate the final models and to calculate the metrics, we used the test dataset (20%) with the best parameters.
and Naive Bayes models were trained with the default values. The Random Forest and SVM parameters were obtained through grid search method, the best results using Random Forest resulted in a model generated with the default parameters, with the exception of the number of trees used (150 estimators) and the criterion parameter setted to 'entropy' for information gain. For SVM, the resulting model was trained with the Radial Basis Function (RBF) kernel, with the Regularization parameter (C) and Kernel coefficient (Gamma) defined in 1000 and 0.8, respectively. ANN and DL were performed with different architectures according to grid search and empirical tests." 4-Thank you for your suggestion. We considered it and provided a new column in the output file (Classification probabilities) in both web server and stand-alone versions.