ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article
Revised

RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction

[version 2; peer review: 2 approved]
PUBLISHED 08 Jun 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cell & Molecular Biology gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Non-coding RNAs (ncRNAs) are important players in the cellular regulation of organisms from different kingdoms. One of the key steps in ncRNAs research is the ability to distinguish coding/non-coding sequences. We applied seven machine learning algorithms (Naive Bayes, Support Vector Machine, K-Nearest Neighbors, Random Forest, Extreme Gradient Boosting, Neural Networks and Deep Learning) through model organisms from different evolutionary branches to create a stand-alone and web server tool (RNAmining) to distinguish coding and non-coding sequences. Firstly, we used coding/non-coding sequences downloaded from Ensembl (April 14th, 2020). Then, coding/non-coding sequences were balanced, had their trinucleotides count analysed (64 features) and we performed a normalization by the sequence length, resulting in total of 180 models. The machine learning algorithms validations were performed using 10-fold cross-validation and we selected the algorithm with the best results (eXtreme Gradient Boosting) to implement at RNAmining. Best F1-scores ranged from 97.56% to 99.57% depending on the organism. Moreover, we produced a benchmarking with other tools already in literature (CPAT, CPC2, RNAcon and TransDecoder) and our results outperformed them. Both stand-alone and web server versions of RNAmining are freely available at https://rnamining.integrativebioinformatics.me/.

Keywords

Machine Learning, non-coding RNA, benchmarking, coding potential prediction

Revised Amendments from Version 1

Here, we present the revised update manuscript. In brief, the minor changes as below;

We updated the abstract

We update the Introduction section with reviewer's suggestion: 1- We included the citations for BASiNET and CoDaN; 2- We added the sentence "Next, RNAmining was evaluated in another 9 phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training."

We restructured the second paragraph of "Machine learning classifier algorithms selection" section and the first paragraph of "Training and testing datasets, model building and quality measuring for coding potential evaluation" section.

We added a new key point in conclusion "RNAmining was evaluated using other phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training."

We update Figure 2 and the source code of RNAmining (including the classification probabilities in the output) as suggested by the reviewers.

See the authors' detailed response to the review by Andre Yoshiaki Kashiwabara
See the authors' detailed response to the review by Gilderlanio Araújo

Introduction

Non-coding RNAs (ncRNAs) are key functional players on different biological processes in organisms from all domains of life1,2. Its investigation is already routine in almost every transcriptome or genome project. Dysregulations in these molecules may lead to different types of human disease, including cancers3, neurological disorders4 and cardiovascular infirmities5.

The genome of eukaryotic6 organisms is, in general, majority composed of non-coding transcripts, with complex organisms estimated to transcribe more than 75% of their genomes7. Besides strong evidence associating these ncRNAs to key functions in the cell, most of them are not yet associated with a functional mechanism. In a transcriptome project there exists an important step in the computational identification of ncRNAs, which is the evaluation of their potential to be translated into proteins using different bioinformatics approaches8,9. To computationally evaluate the coding potential of a set of transcripts, available tools or algorithms normally analyse specific characteristics available in primary sequences (e.g. nucleotides counts, the existence of a trustful open reading frame).

For instance, RNAcon implements a Support Vector Machine (SVM)-based model for the discrimination between coding and non-coding sequences10. Coding Potential Assessment Tool (CPAT)11 assesses the coding potential through an alignment-free method, which uses a logistic regression model built based on different characteristics of the sequence open reading frame (ORF), which includes length, coverage and nucleotides compositional bias. TransDecoder identifies candidate coding transcripts based on other distinctive features from predicted ORFs (e.g. a minimum length ORF, a log-likelihood score, encapsulated ORF)12. CPC213 trained a SVM model using Fickett TESTCODE score, ORF length, ORF integrity and isoelectric point as features. The LIBSVM14 package was employed by training a SVM model using the standard radial basis function kernel (RBF kernel) with the training dataset containing 17,984 high-confident human protein-coding transcripts and 10,452 non-coding transcripts11. CoDaN uses Generalized Hidden Marvov to generate probabilistic models based on the GC content of nucleotide sequences in order to estimate the coding regions and both 5' and 3' untranslated regions of transcripts15. BASiNET performs feature selection to transform nucleotide sequences as complex networks, then it generates topological measures to build a feature vector used to classify the sequences16.

Here, we applied and benchmarked seven different machine learning algorithms (Random Forest, eXtreme Gradient Boosting (XGBoost), Naive Bayes, K-Nearest Neighbors (K-NN), SVM, Artificial Neural Network (ANN) and Deep Learning (DL)) through 15 organisms from different evolutionary branches, in order to evaluate their performance in distinguishing coding and non-coding RNA sequences. Next, we developed a stand-alone and web server tool, called RNAmining (http://rnamining.integrativebioinformatics.me/), by selecting and implementing the algorithm with the best performance in all organisms (XGBoost). Next, RNAmining was evaluated in another 9 phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training. In total, it was evaluated through 24 organisms from the eukaryotic tree of life and its results outperformed publicly available tools commonly used for that purpose.

Methods

Machine learning classifier algorithms selection

In the classification process there is a division related to the learning paradigm, with classification algorithms divided into: (i) Symbolic, which seeks to learn by constructing symbolic representations of a concept through the analysis of examples and counterexamples (e.g. Decision Trees and Rule-based System); (ii) Statistical, which looks for statistical methods and use models to find a good approximation of the induced concept (e.g. Bayesian learning); (iii) Based on Examples (lazy systems), which aims to classify examples never seen using similar known examples, assuming that the new example will belong to the same class as the similar example (e.g. K-Nearest Neighbor); (iv) Based on Optimization, which consists of maximizing (or minimizing) an objective function or finding an optimal hyperplane that best divides two classes (e.g. SVM and Neural Networks); (v) Connectionist Representation, which represents simplified mathematical constructions inspired by the biological model of the nervous system (e.g. Neural Networks). In this benchmarking, we decided to evaluate the performance of selected algorithms from each paradigm type in the coding potential prediction of RNA sequences: Random Forest, XGBoost, Naive Bayes, K-NN, SVM and Neural Networks (ANN and Convolutional Neural Networks (CNN)).

All the machine learning methods were executed using scikit-learn (Version 0.21.3)17, except for Neural Network and DL models which were implemented using Keras API with Tensorflow as backend (Version 2.3.0) and XGBoost algorithm which was executed using XGBoost Library (version 1.2.0)18 in Python Language (Version 3.8). XGBoost, K-NN and Naive Bayes models were trained with the default values. The Random Forest and SVM parameters were obtained through grid search method. The Random Forest and SVM parameters were obtained through grid search method, the best results using Random Forest resulted in a model generated with the default parameters, with the exception of the number of trees used (150 estimators) and the criterion parameter setted to 'entropy' for information gain. For SVM, the resulting model was trained with the Radial Basis Function (RBF) kernel, with the Regularization parameter (C) and Kernel coefficient (Gamma) defined in 1000 and 0.8, respectively. ANN and DL were performed with different architectures according to grid search and empirical tests. The first ANN experiment was composed of three hidden layers consisting of 32-16-8 neurons, respectively; the second ANN experiment was performed with 64-32-16-8 neurons; and the third experiment was executed with 32-32-16-8 neurons. Next, we produced four experiments with DL using 2 CNN layers, followed by 2 fully connected (dense) layers: the first experiment had 512(CNN)-512(CNN) filters and 28(Dense)-1(Dense) neurons; the second was created with 64(CNN)-64(CNN) filters and 128(Dense)-1(Dense) neurons; the third was performed with 32(CNN)-32(CNN)-128(Dense)-1(Dense) neurons; and the last was built with 128(CNN)-128(CNN)-128(Dense)-1(Dense) neurons. These layers received as input the total number of attributes (i.e. combination of trinucleotides counts, described in the next topics). The hyperparameters used to execute the DL and ANN approaches are made available in Extended data: Supplementary File S119.

Datasets selection and filtering criteria

We compared the algorithms performances using different sets of coding and non-coding RNA sequences from Ensembl (April 14th 2020)20 database, covering 15 organisms of distinct representative Chordata clades (Figure 1A): Anolis carolinensis (Sauria, Squamata), Chrysemys picta bellii (Sauria, Testudines), Crocodylus porosus (Archosauria, Pseudosuchia), Danio rerio (Actinopterygii, Teleostei), Eptatretus burgeri (Agnatha, Myxinidae), Gallus gallus (Archosauria, Theropoda), Homo sapiens (Placentalia), Latimeria chalumnae (Sarcopterygii, Coelacanth), Monodelphis domestica (Marsupialia), Mus musculus (Placentalia), Notechis scutatus (Sauria, Squamata), Ornithorhynchus anatinus (Monotremata), Petromyzon marinus (Agnatha, Petromyzontiformes), Sphenodon punctatus (Sauria, Rhynchocephalia), Xenopus tropicalis (Amphibia). All non-coding RNA sequences for each organism were downloaded from Ensembl transcripts. In order to obtain a balanced set of sequences (i.e. equal number of coding and non-coding), the group of coding RNAs were randomly selected in order to obtain the same number of ncRNAs for each species. Moreover, before generating the models, the sequences were normalized through their length (i.e. each trinucleotide count was divided by the total size of the given sequence). All sequences in FASTA format with their respective Ensembl identifiers can be retrieved at RNAmining website (https://rnamining.integrativebioinformatics.me/download).

c422d5f3-3d0d-43e5-9904-6a021f1e21e6_figure1.gif

Figure 1.

A. Taxonomic tree according to the used organisms for the models building (black color) and validation (red color). B. Pipeline used to perform the benchmarking and create the tool. Firstly, we download the coding and non-coding sequences from Ensembl; Next, we performed the trinucleotides counts and sequence normalization. After this, we created a machine learning benchmarking within the 7 algorithms and selected the one with the best performance to be implemented in the RNAmining tool (XGBoost algorithm), which was again evaluated using sequences from 9 other different species and sets of artificially generated ones. Finally, we performed a novel benchmarking with RNAmining against the public available tools for coding potential prediction.

Training and testing datasets, model building and quality measuring for coding potential evaluation

The cross-validation approach was applied in the grid search method, using the training dataset to validate the hyperparameters and obtain the best set of parameters to be used. In addition, this partition method validates the hyperparameter's results through different validation sets. Therefore, it proves that our model is working and generalizing the problem. Thus, sequences were randomly divided into training and testing datasets, using 80% of the data for training and 20% for testing. The connectionist methods (e.g. Artificial Neural Networks and Convolutional Neural Networks) demand a validation dataset to adjust the model, because of the weights optimization stage and its hyperparameters. Thus, for experiments with ANN and CNN, 20% were used for validation, 60% for training (defined as 80% for the other algorithms) and 20% for testing. The testing dataset was the same used in all machine learning algorithms. The number of sequences used for each organism for the training and test sets can be observed in Table 1. Next, we generated 180 models (i.e. one per algorithm for each organism, whereas three experiments for ANN models and four experiments for CNN models), which were further evaluated in this work.

Table 1. Set of sequences used in the training and testing datasets.

List of organisms and the total number of sequences used for testing and training both coding and non-coding RNAs. The numbers are separated into training/testing values. All sequences can be retrieved at RNAmining website (https://rnamining.integrativebioinformatics.me/download).

SpeciesTotalCodingncRNAs
Models Generation (training / testing):
Anolis carolinensis12,542 / 3,1366,243 / 1,5966,299 / 1,540
Chrysemys picta bellii11,260 / 2,8165,626 / 1,4125,634 / 1,404
Crocodylus porosus7,388 / 1,8483,700 / 9183,688 / 930
Danio rerio12,984 / 3,2466,527 / 1,5886,457 / 1,658
Eptatretus burgeri1,742 / 436867 / 222875 / 214
Gallus gallus16,851 / 4,2138,426 / 2,1068,425 / 2107
Homo sapiens92,844 / 23,21246,575 / 11,45346,269 / 11,759
Latimeria chalumnae4,668 / 1,1682,344 / 5742,324 / 594
Monodelphis domestica34,336 / 8,58417,113 / 4,34717,223 / 4,237
Mus musculus35,272 / 8,81817,668 / 4,37717,604 / 4,441
Notechis scutatus2,705 / 6771,351 / 3401,354 / 337
Ornithorhynchus anatinus12,604 / 3,1526,280 / 1,5986,324 / 1,554
Petromyzon marinus4,243 / 1,0612,107 / 5452,136 / 516
Sphenodon punctatus1,456 / 364723 / 187733 / 177
Xenopus tropicalis2,224 / 5561,120 /2701,104 / 286
RNAmining Evaluation:
Arabidopsis thaliana11,3085,6545,654
Caenorhabditis elegans50,55825,27925,279
Carassius auratus15,0047,5027,502
Drosophila melanogaster 31,80815,90415,904
Gorilla gorilla gorilla15,9787,9897,989
Pseudonaja textilis1,486743743
Rattus norvegicus18,6629,3319,331
Saccharomyces cerevisiae848424424
Terrapene carolina triunguis2,0541,0271,027

After selection of the best model, it was applied and evaluated in other nine organisms (Figure 1A), different from the one used in the training process, including five related Chordata and other four phylogenetically distant species. Among the chordates, the models were tested in Carassius auratus (Actinopterygii, Teleostei), Gorilla gorilla gorilla (Placentalia), Pseudonaja textilis (Sauria, Squamata), Rattus norvegicus (Placentalia) and Terrapene carolina triunguis (Sauria, Testudines). Within non-chordates species, we evaluated the model in Arabidopsis thaliana (Plantae, Eudicots), Caenorhabditis elegans (Nematoda), Drosophila melanogaster (Insecta, Diptera) and Saccharomyces cerevisiae (Fungi, Ascomycota). Finally, it was evaluated using artificial sequences containing the same nucleotides composition of the ncRNAs for each species of the testing dataset (Table 1). Ten sets of random sequences containing the same number of ncRNAs per species were generated using MEME suite Version 5.1.1 with default parameters21. All sequences in FASTA format with their respective Ensemble identifiers can be retrieved at RNAmining website (https://rnamining.integrativebioinformatics.me/download).

Comparisons with publicly available tools

The performance of all algorithms in the coding potential evaluation was compared with publicly available tools commonly employed for this purpose (RNAcon10, CPAT11, TransDecoder12 and CPC213), using default parameters. It is worth noting that CPAT only made available models for H. sapiens with a coding probability (CP) cutoff of 0.364 (i.e. CP >=0.364 indicates coding sequence); M. musculus with a CP cutoff of 0.44; D. melanogaster with a CP cutoff of 0.39; and D. rerio with a CP cutoff of 0.38. Therefore, for the other organisms we built new models using our training sets and we used the statistical method provided by the authors to calculate the cutoffs probability for coding prediction: A. carolinensis (0.4); C. picta bellii (0.57); C. porosus (0.38); E. burgeri (0.35); G. gallus (0.42); L. chalumnae (0.365); M. domestica (0.51); N. scutatus (0.15); O. anatinus (0.28); P. marinus (0.34); S. punctatus (0.18); X. tropicalis (0.25). The whole workflow of RNAmining development can be visualized in Figure 1B.

RNAmining tool implementation and availability

The XGBoost method was implemented using XGBoost Library (version 1.2.0) in Python Language (Version 3.8) and the models for each species were saved using pickle Python's library. The web server interface was developed using HTML and CSS. The connection within the front and back-end was implemented through JavaScript. The control of files and the connection with Python's scripts was performed through PHP language. RNAmining user friendly tool and its stand-alone version can be accessed at https://rnamining.integrativebioinformatics.me/. Instructions on how to use it and a whole documentation are made available. Its source code with a Docker platform can be freely obtained at https://gitlab.com/integrativebioinformatics/RNAmining.

Results

Using machine learning algorithms to improve the coding potential prediction of RNA sequences

It is known that the algorithm performance in predictive analysis is influenced by particularities available in the genomes sequences of the organisms used in the training set22, and it should be taken into account when developing novel tools for nucleotides coding prediction. Thus, it is necessary to test several methods to observe which ones can have a good prediction for specific species from evolutionary branches. Similar to Panwar et al.10, we used the trinucleotides count to distinguish coding and non-coding sequences. We evaluated the performance of seven machine learning algorithms using representative organisms from different branches of the Chordata clade. For that, we used a training and testing set composed by sequences from the same species. The algorithm with best performance within all evaluated organisms, according to F1-scores metric, was XGBoost, as one can see in the following: A. carolinensis (98.79); C. picta bellii (98.00); C. porosus (98.15); D. rerio (97.98); E. burgeri (97.56); G. gallus (99.24); H. sapiens (98.50); L. chalumnae (99.57); M. domestica (98.84); M. musculus (97.73); N. scutatus (96.51); O. anatinus (97.61); P. marinus (99.42); S. punctatus (99.20); X. tropicalis (99.13) (Table 2). As observed, XGBoost algorithm presented F-score values above 97.00, with the worst performance obtained for Eptatretus burgeri with a F-score of 97.56. The best performance was obtained for Petromyzon marinus with 99.42. All detailed performances with sensitivity, specificity, precision, accuracy, F1-score and the confusion matrix from each algorithm is listed in Supplementary File S219. Based on these results, XGBoost was selected to be implemented in a novel web server and stand-alone tool for RNA coding potential prediction called RNAmining.

Table 2. Benchmarking machine learning methods for coding potential prediction based on trinucleotides count.

F1-score for each one of the 15 species in which the algorithms were tested. Other metrics (sensitivity, specificity, precision, accuracy and the confusion matrix) used for the comparison of the algorithm’s performance were made available at the Extended data: Supplementary File S219.

SpeciesANNCNNK-NNNAIVE
BAYES
RANDOM
FOREST
SVMXGBoost
Anolis carolinensis98.4798.3193.5595.5098.3098.0398.79
Chrysemys picta bellii96.5496.0293.5493.1396.8996.0498.00
Crocodylus porosus96.7496.4893.6793.9397.2696.3598.15
Danio rerio97.5497.7795.4494.5597.5697.2797.98
Eptatretus burgeri94.8895.6992.2494.5797.3595.8297.56
Gallus gallus98.4798.2796.8795.1198.9198.0699.24
Homo sapiens98.0197.6696.6386.0098.3096.8398.50
Latimeria chalumnae99.0598.7291.6198.2399.5699.2499.57
Monodelphis domestica98.3998.0997.1195.3198.6798.0198.84
Mus musculus96.6796.9695.9591.5697.6696.1097.73
Notechis scutatus95.9094.1087.7789,8194.9495.7396.51
Ornithorhynchus anatinus97.2396.5993.5991.4596.9996.3897.61
Petromyzon marinus98.4098.2688.1095.9998.7997.4999.42
Sphenodon punctatus97.8396.9778.4196.7096.4695.2999.20
Xenopus tropicalis98.2898.8185.5397.1498.8897.2099.13

Using RNAmining in evolutionary related and unrelated organisms

To demonstrate the generalization of the model built in our tool, we evaluated its performance using the following nine Chordata and non-Chordata organisms that were not used in our training step: A. thaliana; C. elegans; C. auratus; D. melanogaster; G. gorilla gorilla; P. textilis; R. norvegicus; S. cerevisiae; Terrapene carolina triunguis. In the training set described in the previous topic, we used sequences from representative species from amphibians, birds, mammals, fishes and reptiles. In this new experiment we executed tests using other chordates, but covering other evolutionary groups such as plants, fungi, insects and nematodes. The F1-score obtained values varying from 86.25 to 98.10. The worst performance was when we used the training set from L. chalumnae (Sarcopterygii, Coelacanth) to predict the coding potential of known coding genes and ncRNAs from D. melanogaster (Insecta, Diptera). However, the best performance was obtained when we applied the training set from C. picta bellii (Sauria, Testudines) in coding and ncRNA sequences from Terrapene carolina triunguis (Sauria, Testudines). The F1-score for each organism, together with the respective training set evaluated, can be found in Table 3, meanwhile the confusion matrix and the other metrics can be visualized in Extended data: Supplementary File S319.

Table 3. Evaluation (F1-score) of the models generated by XGBoost, the method implemented in RNAmining, according to evolutionary related and unrelated organisms.

Each line comprises the model for each one of the trained species, meanwhile the columns represent the set of 9 evolutionary related and unrelated organisms in which the method was evaluated. Other metrics (sensitivity, specificity, precision, accuracy and the confusion matrix) used for the comparisons were made available at the Extended data: Supplementary File S319.

Testing
Training
Arabidopsis
thaliana
Caenorhabditis
elegans
Carassius
auratus
Drosophila
melanogaster
Gorilla
gorilla
Pseudonaja
textilis
Rattus
norvegicus
Saccharomyces
cerevisiae
Terrapene
carolina
triunguis
Anolis carolinensis95.3589.9794.7797.1695.1796.5696.7493.0795.83
Chrysemys picta bellii97.2497.7995.9798.1397.0197.7397.1596.0998.10
Crocodylus porosus96.1996.7695.7397.8797.0196.9097.2595.0797.56
Danio rerio96.6490.5095.2997.9697.2496.8996.4293.9696.62
Eptatretus burgeri94.9095.5794.8096.7395.3495.4395.7691.4995.51
Gallus gallus97.6097.8995.7698.0297.9397.7997.5996.4897.69
Homo sapiens95.7181.2592.1996.4497.7396.2494.6093.5795.65
Latimeria chalumnae93.7196.7891.6386.2596.3093.3994.3795.4795.63
Monodelphis domestica97.4097.9195.6998.0497.9097.5397.4693.5497.31
Mus musculus96.4487.6894.6697.1797.5797.3196.6794.3296.30
Notechis scutatus97.1697.5495.2297.4697.3597.3796.7994.9697.22
Ornithorhynchus anatinus97.3997.4895.3987.7497.3297.8697.2994.6797.53
Petromyzon marinus93.3194.4892.0787.7495.8193.4794.7292.4895.56
Sphenodon punctatus94.0097.0791.9486.8996.6093.9594.1295.0295.81
Xenopus tropicalis93.4696.6591.5384.8695.5193.6893.1694.4295.02

Even without using any plant in the original training set, we applied the different models to predict the coding potential of known coding and ncRNA sequences from A. thaliana (Plantae, Eudicots). The lowest F1-score that RNAmining obtained was 93.31 using a fish model (Petromyzon marinus, Agnatha, Petromyzontiformes). The best F1-score was obtained with a marsupial model (M. domestica, Marsupialia) that reached 97.40. Thus, this experiment demonstrated the efficiency of the method and the models created even when applied in organisms phylogenetically distant from those used in training.

Finally, in order to show that the results obtained were not by chance, we created 10 datasets of artificial sequences containing the same number, length and nucleotides composition of the coding and ncRNA sequences from the 15 species used in our testing shown in Table 1. The F1-score mean, minimum and maximum values of the 10 datasets from each organism can be visualized in Table 5. The confusion matrix and all the other metrics (accuracy, specificity, sensitivity and precision) can be found in Extended data: Supplementary File S419. As we can visualize, the F1 measurement mean remained below 38.00 for all artificial sequences created for the tested organisms, with the exception of P. marinus (F1-score equals to 64.13), which still had a F1-score below to the values obtained with the other organisms tested for the coding potential prediction (Table 4).

Table 4. Evaluation of RNAmining performance according to different sets of artificial sequences from each trained model.

F1-score metrics for 10 datasets of artificial sequences randomly generated for each species. The mean, minimum and maximum values are displayed separated by organism. Other metrics (sensitivity, specificity, precision, accuracy and the confusion matrix) used for the comparisons were made available at the Extended data: Supplementary File S419.

SpeciesMEANMINIMUMMAXIMUM
Anolis carolinensis1.660.862.44
Chrysemys picta bellii1.080.701.40
Crocodylus porosus0.950.431.72
Danio rerio1.250.122.21
Eptatretus burgeri2.310.903.51
Gallus gallus2.481.882.89
Homo sapiens11.1510.5311.52
Latimeria chalumnae24.8621.9527.03
Monodelphis domestica1.341.001.18
Mus musculus6.645.747.58
Notechis scutatus1.800.583.99
Ornithorhynchus anatinus3.622.675.04
Petromyzon marinus64.1362.9965.76
Sphenodon punctatus37.4331.7241.84
Xenopus tropicalis23.2617.6528.21

Table 5. Benchmarking results from RNAmining and the other tools already described in the literature according to organisms from different evolutionary branches.

F1-score metric for CPAT, CPC2, RNAcon, TransDecorder and RNAmining, based on the predictions using models provided by each tool or generated according to their instructions. The bold numbers are the best values regarding F1-score metric. The results for other metrics were made available at the Extended data: Supplementary File S219.

SpeciesCPATCPC2RNAconTransDecoderRNAmining
Anolis carolinensis94.5586.8783.0388.2698.79
Chrysemys picta bellii92.5689.0182.3684.8098.00
Crocodylus porosus94.0792.4884.3287.6398.15
Danio rerio94.6487.1780.9787.7497.98
Eptatretus burgeri95.5978.8275.8476.2697.56
Gallus gallus96.9590.6975.8183.5099.24
Homo sapiens95.2075.8571.7376.0298.50
Latimeria chalumnae99.5791.6097.4598.8699.57
Monodelphis domestica96.2491.4480.9085.2298.84
Mus musculus95.4881.4076.7880.8097.73
Notechis scutatus85.1986.2984.8383.4496.51
Ornithorhynchus anatinus87.4772.0484.7384.6397.61
Petromyzon marinus96.5975.1495.1196.6899.42
Sphenodon punctatus97.6191.9197.8695.2499.20
Xenopus tropicalis99.0797.9298.7097.7799.13

Comparing RNAmining performance with publicly available tools

Next, we compared RNAmining performance with other four tools commonly used for nucleotides coding potential prediction: CPAT, CPC2, RNAcon and TransDecoder. We used as input all coding and ncRNA sequences from the testing dataset used in the 15 species listed in Table 1. According to the F1-score metric, RNAmining outperformed all the tools in all organisms with the exception of CPAT for L. chalumnae, in which both tools presented an equal F1-score of 99.57. The comparative performance of all tools can be observed in Table 5. The detailed results regarding their accuracy, sensitivity, specificity, precision, F1-score and the confusion matrix can be found in Supplementary File S219. Finally, we used the t-student test to compare the results from RNAmining and the other tools, revealing that our software presented significantly better results in performing coding potential predictions based on known coding genes and ncRNAs. The p-values obtained in these comparisons were: 0.0026 (vs CPAT); 1.57e-05 (vs CPC2); 2.69e-05 (vs RNAcon); and 2.89e-05 (vs TransDecoder).

RNAmining stand-alone and web server tool

RNAmining tool was made available in both stand-alone and web server versions. The tools only require the nucleotide sequences of the RNAs in which the user intends to perform the coding potential prediction in FASTA format, together with the species name in a standardized format related to the model to be used. Besides our tool presented good results even when using phylogenetically distant organisms, we recommend to always use the most closely related species to the one the user wants to perform the predictions. Furthermore, RNAmining documentation presents all the guidelines on how to generate a model for a particular set of sequences and organisms of interest. The web interface of RNAmining tool was developed to allow users to quickly perform the coding potential prediction without the need of installing any specific program and using only a generic internet browser. The only requirement for running the tool is a FASTA file containing the nucleotide sequences and the organism model that the user wants to use, which can be selected in a drop-down menu containing all 15 organisms used in the training step (Figure 2A). There is no limit of the number of sequences, but the web server supports files up to 20Mb. For files bigger than that, we recommend using the stand-alone RNAmining tool. RNAmining will automatically classify the FASTA sequences used as input and identify which of them are coding or non-coding RNAs. Finally, as a result it offers a table with the sequences’ IDs, its classification as coding or non-coding and the classification probabilities, which can also be downloaded in tabular format, together with two separate FASTA files containing both the coding and non-coding sequences separately (Figure 2B).

c422d5f3-3d0d-43e5-9904-6a021f1e21e6_figure2.gif

Figure 2. RNAmining web server overview.

A. Job launcher screen (Run tab). The user only needs to upload the nucleotide sequences in FASTA format and select the model to be used based on the evolutionary close related species. B. Results web page screen. General report containing the list of coding and non-coding sequences in a dynamic table, in which the user can search for a particular sequence or filter only those coding or non-coding RNAs by using a free text form that will filter the results in the table dynamically. The user can download the complete table in tabular format and two FASTA files containing the set of coding and non-coding RNAs separately.

Discussion

The coding potential prediction of nucleotides is a key step in the definition of the repertoire of non-coding RNAs in a genome or transcriptome project, especially when dealing with non-model organisms. Sometimes, predictive tools for the computational characterization of RNA molecules in analyses like the prediction of specific RNA families22 or the estimation of a network of RNA-RNA23 or protein-RNA interactions24, have their performance affected according to the training organism, increasing the number of false positives when applied in evolutionarily distant species. In this work, we evaluated the performances of seven different supervised machine learning algorithms, using eukaryotic species from a variety of evolutionary clades, revealing their potential to be used in the development of novel and improved computational tool for the coding potential prediction of RNA sequences. Artificial intelligence has been widely used in computational biology25,26, but its application to characterize ncRNAs has been limited.

In this benchmarking, we opted to analyze the trinucleotides count as the main feature to be evaluated for the coding potential prediction, followed by a normalization considering the sequences length (i.e. each trinucleotides count was divided by the total size of the given sequence). Panwar et al.10 used nucleotides counting successfully for this purpose. They considered 40,905 non-coding RNAs from Rfam release 10.0 database and 62,473 coding RNA sequences from Human RefSeq database, divided into 50% of training and 50% of test (i.e. the training and test sets were composed of 20,453 non-coding and 31,237 coding sequences). They used the counts of mono-, di-, tri-, tetra- and penta-nucleotides and a combination of all counts using the SVM method, and showed that using trinucleotides count is enough to predict the coding potential of ncRNAs with better accuracies. Our comparisons of the machine learning algorithms revealed XGBoost as the algorithm with better performance, presenting efficiency in predicting the coding potential of RNA sequences even when using the models of distantly related organisms. This latter shows the usefulness of this approach for performing coding predictions in non-model organisms.

We implemented XGBoost in RNAmining, a stand-alone and web server tool flexible to be used in genome or transcriptome projects focused in both model and non-model eukaryotic organisms. Our tool outperformed similar approaches, such as CPAT11, CPC213, RNAcon10 and TransDecoder12. Both versions of the software are easy to use, with the web version providing a simple report and FASTA format files that can be used in downstream analysis. It provides 15 models generated from eukaryotic from different evolutionary clades. Other models can be generated by the user using the stand-alone version, which can be used with simple command line operations. These features facilitate its usage for experienced users and, especially, for those without any programming experience, which can easily perform large-scale predictions of the coding potential of nucleotide sequences in both genome or transcriptome initiatives.

Conclusions

  • We used pattern recognition approaches to investigate the coding potential prediction of RNAs, using 64 features (all combinations of trinucleotides count).

  • We performed a benchmarking from seven machine learning algorithms (Naive Bayes, SVM, K-NN, Random Forest, XGBoost, ANN and DL), through 15 model organisms from different evolutionary branches and implemented the best one (XGBoost) in a novel tool (RNAmining).

  • RNAmining is a user-friendly coding potential prediction web tool that performs XGBoost algorithm to predict the coding potential of RNA sequences.

  • RNAmining was evaluated using other phylogenetically related and unrelated organisms that were not used in our training, demonstrating the efficiency of the tool even when applied in species phylogenetically distant from those used in training.

  • A comprehensive analysis using data from 15 organisms revealed that RNAmining outperformed other tools available in literature (CPAT, CPC2, RNAcon and TransDecoder).

Data availability

Underlying data

Ensembl is an open access genome browser for vertebrate genomes in the Ensembl website (https://www.ensembl.org/index.html).

RNAmining is a tool for coding potential prediction which is freely available at (https://rnamining.integrativebioinformatics.me/download).

Extended data

Zenodo: RNAmining Software Supplementary Material, http://doi.org/10.5281/zenodo.469957119

This project contains the following extended data:

  • - Supplementary File S1: ANN and DL parameters

  • - Supplementary File S2: All metrics used for the comparison of the algorithm’s performance from the 15 model organisms.

  • - Supplementary File S3: All metrics used for the XGBoost algorithm’s performance from the 9 evolutionary related and unrelated organisms in which the method was evaluated.

  • - Supplementary File S4: All metrics used for the XGBoost algorithm’s performance from the artificial sequences created for the tested organisms.

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Software availability

RNAmining is available from: https://rnamining.integrativebioinformatics.me/

Source code available from: https://gitlab.com/integrativebioinformatics/RNAmining/-/tree/master/volumes/rnamining-front/assets/scripts/ and https://github.com/thaisratis/RNAmining

Archived source code as at time of publication: https://doi.org/10.5281/zenodo.489191427

License: MIT

Comments on this article Comments (1)

Version 2
VERSION 2 PUBLISHED 08 Jun 2021
Revised
  • Reader Comment 26 Jul 2021
    PLM ROGINSKI, UVIC, France
    26 Jul 2021
    Reader Comment
    Dear authors,

    Thank you for your work.
    I reproduced your results thanks to the provided data and noticed that your model is working significantly better on phase 0 than ... Continue reading
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Ramos TAR, Galindo NRO, Arias-Carrasco R et al. RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction [version 2; peer review: 2 approved]. F1000Research 2021, 10:323 (https://doi.org/10.12688/f1000research.52350.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 08 Jun 2021
Revised
Views
3
Cite
Reviewer Report 11 Jun 2021
Gilderlanio Araújo, Laboratory of Human and Medical Genetics, Postgraduate Program in Genetics and Molecular Biology, Universidade Federal do Pará, Belém, Brazil 
Approved
VIEWS 3
The authors made the corrections. The authors clarified ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Araújo G. Reviewer Report For: RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction [version 2; peer review: 2 approved]. F1000Research 2021, 10:323 (https://doi.org/10.5256/f1000research.57466.r87006)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
6
Cite
Reviewer Report 09 Jun 2021
Andre Yoshiaki Kashiwabara, Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Parana, Curitiba, Brazil 
Approved
VIEWS 6
The authors have ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kashiwabara AY. Reviewer Report For: RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction [version 2; peer review: 2 approved]. F1000Research 2021, 10:323 (https://doi.org/10.5256/f1000research.57466.r87007)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 26 Apr 2021
Views
17
Cite
Reviewer Report 25 May 2021
Andre Yoshiaki Kashiwabara, Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Parana, Curitiba, Brazil 
Approved with Reservations
VIEWS 17
This paper presents RNAmining to predict the protein-coding potential of transcripts. The authors have compared many algorithms using cross-validation and selected XGBoost. The tool has the potential to be very useful. It is available online, and it is easy to ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kashiwabara AY. Reviewer Report For: RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction [version 2; peer review: 2 approved]. F1000Research 2021, 10:323 (https://doi.org/10.5256/f1000research.55616.r84936)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 08 Jun 2021
    Thaís A. R. Ramos, Programa de Pós-Graduação em Bioinformática, Bioinformatics Multidisciplinary Environment (BioME), Instituto Metrópole Digital, Universidade Federal do Rio Grande do Norte, Natal, Brazil
    08 Jun 2021
    Author Response
    1- RNAmining was trained using coding genes and ncRNAs from the Ensembl database. It evaluates the patterns in the tri-nucleotide counts in any RNA sequence (which could be an ORF or ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 08 Jun 2021
    Thaís A. R. Ramos, Programa de Pós-Graduação em Bioinformática, Bioinformatics Multidisciplinary Environment (BioME), Instituto Metrópole Digital, Universidade Federal do Rio Grande do Norte, Natal, Brazil
    08 Jun 2021
    Author Response
    1- RNAmining was trained using coding genes and ncRNAs from the Ensembl database. It evaluates the patterns in the tri-nucleotide counts in any RNA sequence (which could be an ORF or ... Continue reading
Views
20
Cite
Reviewer Report 10 May 2021
Gilderlanio Araújo, Laboratory of Human and Medical Genetics, Postgraduate Program in Genetics and Molecular Biology, Universidade Federal do Pará, Belém, Brazil 
Approved with Reservations
VIEWS 20
  1. Are there two different datasets of model organisms?
    On the abstract "...15 organisms from different evolutionary branches..." 
    On the main text "RNAmining was evaluated through 24 organisms from the eukaryotic tree of life
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Araújo G. Reviewer Report For: RNAmining: A machine learning stand-alone and web server tool for RNA coding potential prediction [version 2; peer review: 2 approved]. F1000Research 2021, 10:323 (https://doi.org/10.5256/f1000research.55616.r83973)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 08 Jun 2021
    Thaís A. R. Ramos, Programa de Pós-Graduação em Bioinformática, Bioinformatics Multidisciplinary Environment (BioME), Instituto Metrópole Digital, Universidade Federal do Rio Grande do Norte, Natal, Brazil
    08 Jun 2021
    Author Response
    1- The connectionist methods (e.g. Artificial Neural Networks and Convolutional Neural Networks) demand a validation dataset to adjust the model, because of the weights optimization stage and its hyperparameters. Thus, for ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 08 Jun 2021
    Thaís A. R. Ramos, Programa de Pós-Graduação em Bioinformática, Bioinformatics Multidisciplinary Environment (BioME), Instituto Metrópole Digital, Universidade Federal do Rio Grande do Norte, Natal, Brazil
    08 Jun 2021
    Author Response
    1- The connectionist methods (e.g. Artificial Neural Networks and Convolutional Neural Networks) demand a validation dataset to adjust the model, because of the weights optimization stage and its hyperparameters. Thus, for ... Continue reading

Comments on this article Comments (1)

Version 2
VERSION 2 PUBLISHED 08 Jun 2021
Revised
  • Reader Comment 26 Jul 2021
    PLM ROGINSKI, UVIC, France
    26 Jul 2021
    Reader Comment
    Dear authors,

    Thank you for your work.
    I reproduced your results thanks to the provided data and noticed that your model is working significantly better on phase 0 than ... Continue reading
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.