shiny-pred : a server for the prediction of protein disordered regions

Intrinsically disordered proteins or intrinsically disordered regions (IDR) are segments within a protein chain lacking a stable three-dimensional structure under normal physiological conditions. Accurate prediction of IDRs is challenging due to their genome wide occurrence and low ratio of disordered residues, making them a difficult target for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy, which is time consuming and computationally expensive. The shiny-pred application is an   sequence-only disorder predictor ab initio implemented in R/Shiny language. In order to make predictions, it uses convolutional neural network models, trained using PDB sequence data. It can be installed on any operating system on which R can be installed and run locally. A public version of the web application can be accessed at https://gmu-binf.shinyapps.io/shiny-pred

This article is included in the RPackage gateway.

Reviewer Status
Invited Reviewers Any reports and responses or comments on the article can be found at the end of the article.

Introduction
Experimental structure resolution of intrinsically disordered proteins/intrinsically disordered regions (IDP/IDRs) is complex, lengthy and expensive, leading to a variety of computational approaches being developed (He et al., 2009). Over 60 computational protein disorder prediction servers are currently available, although not all publicly. Methods can be classified in one of the following categories (Atkins et al., 2015): (i) Ab initio or sequence-based, (ii) clustering, (iii) template based, and (iv) meta or consensus.
shiny-pred is an ab initio predictor, which means it relies exclusively on amino acid sequence information to make disordered predictions. It uses prediction models based on convolutional neural networks and reduced protein alphabets. Currently there are three available models, each one built using the same training protein data from PDB (Berman et al., 2000) but differing on the convolutional neural network architecture. Since it doesn't rely in sequence profiles to make predictions, it is fast to be used in proteome-wide disorder scenarios. It performs at the same level or outperforms other state of the art sequence-only methods, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available CASP10 dataset (Monastyrskyy et al., 2014), at faster speeds.

Methods
Implementation shiny-pred is written in the R programming language (R Core Team, 2017) and the shiny web application framework is implemented using the Shiny R package v1.1.0 (Chang, 2018).
Currently, three convolutional neural network models are made available by our application: (i) cnn-64-ker-local, is a one layer convolutional network (step size 1 and window size of 32) with 64 kernels and local max pooling model; (ii) cnn-128-ker-local, implements one convolutional layer (step size 1 and window size of 32) with 128 kernels and local max pooling model; and (iii) cnn-2-conv-local implements two convolutional layers (64 and 32 kernels) with local max pooling.

Operation
Our tool has two operation modes; predicting disordered residues in protein sequences (prediction) and benchmarking the predictor performance against sequences with known disorder information (benchmark). The mode is selected automatically based on the format of the input sequences. Users can either upload a sequence file, type/paste a sequence into the text area or select pre-loaded examples from a list.
When in prediction mode, the amino acid sequences are expected to be in FASTA format (Figure 1). In benchmark mode, input sequences in FASTA format are expected to have an additional line containing the disorder information (D=disorder, O=ordered). Multiple sequences can be submitted at once; several examples for different types of submissions (prediction and benchmark modes) are made available as examples. In both modes, the application will show a result panel, where for each input sequence a graph with the probability of disorder per residue is plotted (Figure 2).
(1) Prediction mode The workflow for protein disorder prediction is: (i) Input the target sequences (in FASTA format) in the text area; (ii) Select the model to use for the prediction (default is cnn-128-ker-local) and submit the sequence for prediction; (iii) Visualize and download results.
(2) Benchmark mode In benchmark mode, input sequences are expected to have an extra line with the actual disorder information to be used as benchmark. Result tables will populate two extra columns (actual class and match) with the actual disorder information and if the prediction was accurate for the current residue. An extra panel (Benchmark) shows the ROC curve along with other common binary metrics (sensitivity, specificity, balance accuracy and Matthews correlation coefficient).

Use cases
We use shiny-pred to predict disordered regions within the publicly available CASP10 benchmark dataset. The dataset contains 94 target sequences, each one annotated with the disorder/order information at the residue level. The annotated dataset is provided as an example ('CASP_all') and it can be selected form the example selection list on the 'Sequence Input' tab. Figure 3 shows the input panel after the dataset is selected and loaded. Predictions per sequence can be viewed and downloaded from the 'Results' tab while the 'Benchmark' tab provides a summary of the performance using binary and statistical metrics. Figure 4 shows the server performance for the input dataset, achieving an AUC value of 0.85 and balance accuracy of 0.75.

Summary
This article presents shiny-pred, a sequence-only ab initio web application for predicting protein disorder. It's based on reduced amino acid alphabets and convolutional neural networks, being fast and accurate, it is suitable for large proteome-wide experiments.

Grant information
The author(s) declared that no grants were involved in supporting this work. The authors presented yet another neural network-based disorder prediction tool written in R, trained on PDB data and benchmarked on CASP10 dataset and they claim that the tool outperforms other existing tools in terms of both calculation speed and performance.

Open Peer Review
We tried using the tool for predicting the known disordered sequences and found that the predictions are accurate and similar to other tools such as IUPRED, DISOPRED3 servers for the well-known disordered sequences such as p53 and Histatin5.
In terms of concerns, I have following comments to make: In general, I find the paper does not describe the motivation, methods and the results in a self-sufficient manner and these could be elaborated further.

1.
As the authors state in the paper, there are over 60 tools already existing for disorder prediction.The justification for requiring another tool is not clearly stated.

2.
The authors mention they have used PDB data for training the neural network. Do they take all the currently available PDB datasets for training? Does any overlap exist between the datasets trained and benchmarked? The reason why I am asking this is the CASP10 dataset that the authors used for benchmarking has been released in 2012, which would be a subset of the training PDB dataset if they have taken all the PDB data published till date.

3.
The authors claim that their method is faster than the existing methods. It would be nice to provide evidence towards that and provide some benchmarking data.

4.
AUC and balance accuracy are the two metrics used for evaluating the performance of the 5.