Keywords
Disordered proteins, machine learning, convolutional neural networks, R, Shiny
This article is included in the RPackage gateway.
This article is included in the Artificial Intelligence and Machine Learning gateway.
Disordered proteins, machine learning, convolutional neural networks, R, Shiny
Experimental structure resolution of intrinsically disordered proteins/intrinsically disordered regions (IDP/IDRs) is complex, lengthy and expensive, leading to a variety of computational approaches being developed (He et al., 2009). Over 60 computational protein disorder prediction servers are currently available, although not all publicly. Methods can be classified in one of the following categories (Atkins et al., 2015): (i) Ab initio or sequence-based, (ii) clustering, (iii) template based, and (iv) meta or consensus.
shiny-pred is an ab initio predictor, which means it relies exclusively on amino acid sequence information to make disordered predictions. It uses prediction models based on convolutional neural networks and reduced protein alphabets. Currently there are three available models, each one built using the same training protein data from PDB (Berman et al., 2000) but differing on the convolutional neural network architecture. Since it doesn't rely in sequence profiles to make predictions, it is fast to be used in proteome-wide disorder scenarios. It performs at the same level or outperforms other state of the art sequence-only methods, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available CASP10 dataset (Monastyrskyy et al., 2014), at faster speeds.
shiny-pred is written in the R programming language (R Core Team, 2017) and the shiny web application framework is implemented using the Shiny R package v1.1.0 (Chang, 2018).
Currently, three convolutional neural network models are made available by our application:
(i) cnn-64-ker-local, is a one layer convolutional network (step size 1 and window size of 32) with 64 kernels and local max pooling model; (ii) cnn-128-ker-local, implements one convolutional layer (step size 1 and window size of 32) with 128 kernels and local max pooling model; and (iii) cnn-2-conv-local implements two convolutional layers (64 and 32 kernels) with local max pooling.
The models were created, trained and accessed using the keras R package v2.1.6 (Allaire & Chollet, 2018).
Our tool has two operation modes; predicting disordered residues in protein sequences (prediction) and benchmarking the predictor performance against sequences with known disorder information (benchmark). The mode is selected automatically based on the format of the input sequences. Users can either upload a sequence file, type/paste a sequence into the text area or select pre-loaded examples from a list.
When in prediction mode, the amino acid sequences are expected to be in FASTA format (Figure 1). In benchmark mode, input sequences in FASTA format are expected to have an additional line containing the disorder information (D=disorder, O=ordered). Multiple sequences can be submitted at once; several examples for different types of submissions (prediction and benchmark modes) are made available as examples. In both modes, the application will show a result panel, where for each input sequence a graph with the probability of disorder per residue is plotted (Figure 2).
(1) Prediction mode
The workflow for protein disorder prediction is:
(i) Input the target sequences (in FASTA format) in the text area;
(ii) Select the model to use for the prediction (default is cnn-128-ker-local) and submit the sequence for prediction;
(iii) Visualize and download results.
(2) Benchmark mode
In benchmark mode, input sequences are expected to have an extra line with the actual disorder information to be used as benchmark. Result tables will populate two extra columns (actual class and match) with the actual disorder information and if the prediction was accurate for the current residue. An extra panel (Benchmark) shows the ROC curve along with other common binary metrics (sensitivity, specificity, balance accuracy and Matthews correlation coefficient).
We use shiny-pred to predict disordered regions within the publicly available CASP10 benchmark dataset. The dataset contains 94 target sequences, each one annotated with the disorder/order information at the residue level. The annotated dataset is provided as an example (‘CASP_all’) and it can be selected form the example selection list on the ‘Sequence Input’ tab. Figure 3 shows the input panel after the dataset is selected and loaded. Predictions per sequence can be viewed and downloaded from the ‘Results’ tab while the ‘Benchmark’ tab provides a summary of the performance using binary and statistical metrics. Figure 4 shows the server performance for the input dataset, achieving an AUC value of 0.85 and balance accuracy of 0.75.
This article presents shiny-pred, a sequence-only ab initio web application for predicting protein disorder. It's based on reduced amino acid alphabets and convolutional neural networks, being fast and accurate, it is suitable for large proteome-wide experiments.
Software available from: https://gmu-binf.shinyapps.io/shiny-pred
Source code available from: https://github.com/mauricioob/shiny-pred
Archived source code as at time of publication: https://doi.org/10.5281/zenodo.2567259 (Mauricio, 2019).
License: GNU public license (GPL-3)
The authors are grateful for the computational facilities provided by Novartis Institutes of Biomedical Research.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
No
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
No
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Membrane Biophysics, Protein Structures and Folding, Mechanotransduction, Statistical mechanics of Biological Systems, Integrative Modeling, Multiscale Biomolecular Simulations
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational biology, machine learning, optimization.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 28 Feb 19 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)