Keywords
Generative Adversarial Network, De novo Protein Design, Machine Learning, Neural Network, Deep Learning, Helical Proteins, Protein backbone, Long Short-Term Memory
This article is included in the Artificial Intelligence and Machine Learning gateway.
This article is included in the INCF gateway.
Generative Adversarial Network, De novo Protein Design, Machine Learning, Neural Network, Deep Learning, Helical Proteins, Protein backbone, Long Short-Term Memory
In this new version we took the advice of the reviewers and added more results that shows the pros and cons of the neural network setup, as well as explain the similarities and differences between our manuscript and other similar published papers.
See the authors' detailed response to the review by Matias Valdenegro-Toro
See the authors' detailed response to the review by Jan Ludwiczak and Stanislaw Dunin-Horkawicz
The concept of amino acid sequences folding into globular protein molecules allows for proteins’ large functional diversity, mediating all the functional aspects of living organisms, thus winning themselves attention from biochemists for decades. The fusion of machine learning and deep learning with computational biology is accelerating research in both fields and bringing humanity closer to the setup of performing most biological research quickly, cheaply, and safely in silico, while only translating the very crucial aspects of it. Having access to a large database of protein crystal structures has led to the use of machine learning to design proteins computationally.
De novo protein design (i.e. from the beginning) is very well explained in this review1. Proteins fold into a specific shape depending on the sequence of their amino acids, and of course shape dictates function. The driving forces that allow proteins to fold are the hydrogen bond interactions within the backbone and between the side chains, the Van der Waals forces, and principally the interaction of hydrophobic side chains within the core. The space of all possible sequences for all protein sizes is extremely large (as an example there are 20200 possibilities for a 200-residue protein). Thus, is it not surprising that natural proteins exist in clusters close to each other, which is logical since proteins would evolve away from a central functional protein to fold correctly and acquire new folds and functions, rather than go through the tedious ordeal of finding a totally new protein structure within the space of all possibilities. Thus, even though the Protein Data Bank adds about 10,000 new structures to its repository every year, most of these new structures are not unique folds.
The relationship between the sequence of a protein and its specific structure is understood, but we still lack a unified absolute solution to calculate one from the other. Hence why some research groups generated man-made protein designs by altering already existing natural proteins2, since randomly finding a functionally folded protein from the space of all possible protein sequences is more or less statistically impossible. On the other hand, other researchers have attempted de novo protein design by designing a topology from assembling short sequence peptide fragments taken from natural protein crystal structures3,4, these fragments are calculated statistically depending on the secondary structures they are found in. Sometimes this fragment system is combined with first physical principals to model the loops between secondary structures to achieve a desired three-dimensional topology5. Others have used parametric equations to study and specify the desired protein geometry6–11. These solutions employ an energy function, such as REF2015, that uses some fundamental physical theories, statistical mechanical models, and observations of protein structures to approximate the potential energy of a protein12. Knowing the protein potential energy allows us to guide our search for the structure of a protein given its sequence (the structure resides at the global energy minima of that protein sequence) thus attempting to connect the sequence of a protein with its structure. The critical tool is the energy function, the higher its accuracy the higher our confidence in knowing the computed structure is the real natural structure. Thus, using the energy function to perform structure prediction (going from a known sequence to find the unknown three-dimensional structure) can also be used to perform fixed-backbone design (going from a known three-dimensional structure to find the sequence that folds it). This is where this paper comes in. Where as in de novo design neither backbone nor sequence is known, knowing one results in finding the other using the same energy function1, and a good starting point is to design the backbone.
Other researchers have used machine learning for protein sequence design, employing the constraints (C α-C α distances) as the input features for the network and using a sliding window to read a sequence of residues, getting their types and constraints then predicting the next one giving the output prediction as an amino acid sequence13, this architecture reported an accuracy of 38.3% and performs what is called sequence design: designing a sequence for a backbone, so when the protein is synthesised it folds to that backbone. In fact, in the5 paper the protocol first generated an all-valine backbone, then sequence designed that backbone. In this paper, we want to computationally generate a backbone so it can be sequence designed using other protocols such as RosettaDesign4,14 or the protocol from13.
The protein’s backbone can be folded using the ϕ and ψ angles, which are the angles between C α and N for ϕ and ψ C α and C for that primarily move the amino acid’s backbone (not the side chains), and thus can be used as features to control the topology of a backbone. They were even one of the features used to fold proteins and predict their structures in AlphaFold15.
But the question is: how do we decide the ideal angles for a helix, the length of each helix, the number of helices, as well as the lengths and angles of the loops between the helices that will also result is a compact folded protein backbone. These numerous values can be solved statistically using neural networks. Especially that we want to use the structures in the PDB to forward design (rather then discover) new protein folds that are not a result of evolution.
The deep neural network architecture that we chose was a long short-term memory (LSTM) network16. The LSTM is usually used in natural language and data sequence processing, but in our model the LSTM was used to generate a sequence of ϕ and ψ angles. The model was constructed as a stack of LSTM layers, followed by fully connected layers, followed by a mixture density network (MDN)17 and worked by using random noise numbers (a latent) as input to build the values for the ϕ and ψ angles18.
Our effort in this paper was to use deep learning to learn the general fold of natural proteins, and using this generalising statistical concept to design novel protein backbone topologies, thus only getting the three dimensional backbone structure so it can be sequence designed using other protocols. Our research at this moment was a proof of concept and only concerned with getting a new and unique folded ideal helical protein backbone rather than a protein with a specific sequence, nor a function, nor a specific predetermined structure. Our system resulted in random yet compact helical backbone topologies only.
Previous attempts at backbone design using deep learning were successful, it was possible to generate protein backbones using a GAN where a distance map was used as features for the neural network, then after generating a new distance map an Alternating Direction Method of Multipliers was used to convert the distance map to cartesian coordinates and then Rosetta Remodel was used to trace the backbone into a folded protein structure19,20. Another paper detailed a VAE (variational auto-encoder) to reconstruct the backbone of the variable region of an immunoglobuline21 using a distance map and evaluating the output using a distance map and torsion angles. Our method is similar to 19 yet instead of using a GAN to generate structures in chunks of 16, 64, or 128 lengths, or to generate a specific structure type as in21 and 20, we used an LSTM with the torsion angles as features to generate variable length structures between 80 and 150 amino acids using a simpler architecture, though producing only helical structures.
The following steps were used to generate the augmented training dataset, along with details of the neural network architecture and how the output was optimised then evaluated.
The entire PDB database was downloaded on 28th June 2018 (~150,000 structures), and each entry was divided into its constituent chains resulting in individual separate structures (i.e: each PDB file had only a single chain). Each structure was analysed and chosen only if it contained the following criteria: contained only polypeptides, had a size between 80 and 150 amino acids without any breaks in the chain (a continuous polypeptide), a sum of residues that made up helices and sheets were larger than the sum of amino acids that made up loops, and the final structure having an Rg (radius of gyration) value of less than 15 88 Å. The chosen structures were then further filtered by a human to ensure only the desired structure concepts were selected, removing the structures that slipped through the initial computational filter. Furthermore, a diversity of structure folds were achieved rather than numerous repeats of the same fold (the haemoglobin fold was quite abundant). In previous attempts, a mixture of different structure classes were used, where some structures were only helices, some were only sheets, and the remaining were a mix of the two. However, that proved challenging in optimising the network, and as such a dataset made up of only helical protein structures was chosen for this initial proof of concept. The final dataset had 607 ideal helical structures. These structures were then cleaned (non-amino acid atoms were removed) in preparation to push them through the Rosetta version 3 modelling software that only takes in polypeptide molecules.
These 607 structures were augmented using the Rosetta FastRelax protocol22. This protocol performs multiple cycles of packing and minimisation. In other words, the protocol performs small slight random angle moves on the backbone and side chains in an attempt to find the lowest-scoring variant, but the random backbone angle moves is what we were after. Its originally intended function was to move a structure slightly to find the conformation of the backbone and side chains that corresponds to the lowest energy state as per the REF2015 energy function. Since the protocol performs random moves, a structure relaxed on two separate occasions will result in two molecules that look very similar with similar minimum energy scores, but technically have different ϕ and ψ angle values. This is the concept we used to augment our structures, and each structure was relaxed 500 times to give a final dataset size of 303,500 structures.
Using only the ϕ and ψ angle values from a crystal structure it was possible to re-fold a structure back to its correct native fold, thus these angles were the only relevant features required to correctly fold a structure, Figure 1A details the range of angles in the un-augmented data. Each amino acid’s ϕ and ψ angle values were extracted and tabulates as in Table 1. This was the dataset used to train the neural network.
1A: Ramachandran plot of the dataset showing the ϕ and ψ angles of each amino acid for each structure. This is the unaugmented data of structures that are only made of helices. Green represents the angles on amino acids in loops, while red represents the angles of amino acids in helices. Some orange can be seen where the DSSP algorithm classified the amino acids as sheets (though there were none). One point to note; the angles here are represented between the range −180° to 180° as is conventional, while in the actual dataset the range was from 0° to 360°. 1B: The network’s output ϕ and ψ angles for 25 structures after the relaxation step. The green dots represent the angles of amino acids within loops, and red within helices clustering around the same location as Figure 1A within the fourth quadrant as is desired for an α-helix that has ideal angles around (−60°,−45°). These structures culminated to all 25 structures in Figure 4, and had an angle range for the helices (−127.4°<ϕ<−44.7°, −71.3°<ψ<30.6°) not including the outliers. The purple dots represent the helices in the control structures, and the black dots the loops.
The value 360.0° was used for the first missing angle while 0.0° was used to represent no residues.
The model in Figure 2 was built using the SenseGen model as a template18 and consisted of a network constructed from an LSTM layer with 64 nodes, followed by two dense fully connected MLPs with 32 nodes for the first layer and 12 nodes for the second one, both employed a sigmoid activation function:
The full protocol showing the structure of the neural network model and its output. The model employing an LSTM network. The network’s output is the generated ϕ and ψ angle sequences which were applied to a primary structure (a fixed 150 valine length straight structure generated by PyRosetta) that resulted in the development of the secondary structures but not a final compact structure structure due to suboptimal loop structures as a result of their variability in the dataset. To overcome this, the structure was relaxed to bring bring the secondary structure helices together. This did result in more compact structures but was not always ideal, thus a filter was used to filter our non-ideal structures and keep an ideal structure when generated.
Which was followed by an MDN layer employing an MDN activation function:
c: the index of the corresponding mixture component. α: the mixing parameter. 𝓓: the corresponding distribution to be mixed. λ: the parameters of the distribution 𝓓, as we denote 𝓓 to be a Gaussian distribution, λ1 corresponds to the conditional mean and λ2 to the conditional standard deviation. The training was done using the Adam optimiser23, for each parameter ωj:
η: initial learning rate. vt: exponential average of squares of gradients. gt: gradient at time t along ωj. The loss defined as the root mean squared difference between the sequence of inputs and the sequence of predictions:
yt: output. xt: next step sample xt+1 = yt.
The network used random noise as a starting seed, this noise was generated by taking a single randomly distributed number between [0, 1) as the first predicted values, these values were then reshaped to the same shape of the last item of the predicted value resulting in a final shape of (batch_size, step_number, 1). The network predicted the main parameters of the new value (µ, σ, π) several times (according to the number_of_mixtures value) and selected the single mixture randomly but according to the π value. It then predicted the next value according to the normal distribution using the µ and σ values. It added the final value to the prediction chain and then returned to step 2 until the predefined sequence length was obtained. The initial random number was stripped from the returned sequence. Once the networks were constructed, the dataset was normalised and the training was done using the following parameters: the learning rate was 0.0003 with a batch size of 4 over 18,000 epochs.
The output of the neural network was always a sequence of 150 ϕ and ψ angle value combination for a structure with 150 amino acids. A straight chain of 150 valines was computationally constructed using PyRosetta and used as a primary structure. Each amino acid in the primary structure had its angles changed according to the ϕ and ψ angle values, which ended up folding that primary structure resulting in secondary structures of helices and loops between them. The COOH end was trimmed, if it was a loop, until it reached the first amino acid that comprised a helix, thus variable structure sizes where generated. The structure ended up with helices and loops yet still with an open conformation. The generated structure was therefore relaxed using PyRosetta version 4 FastRelax protocol to idealise the helices and compact the structure. Furthermore, not every prediction from the neural network resulted in an ideal structure even after the relax step, therefore we employed a filter to filter out structures we deemed not ideal. The filter discards structures that were less than 80 amino acids, had more residues in loops than in helices, had less than 20% residues making up its core, and had a maximum distance between Cα1 and any other Cα greater than 88 Å (the largest value in the dataset). PyRosetta was used24 since it was easier to integrated the code with the neural network’s python script and combine the Rosetta engine with Biopython25 and DSSP26,27.
As a control we used the de novo protein design protocol using the Rosetta Script from the 5 paper’s supplementary material. The protocol was modified slightly to accommodate new updates in the Rosetta software suite, but maintained the talaris2014 energy function as in the original paper, and we used the protocol to design helical proteins. These proteins had to pass through several filters including a talaris2014 score of less the -150, a packing threshold of more than 0.50, and a secondary structure threshold of more than 0.90. We attempted to design proteins with 3, 4, and 5 helices to compare the backbone quality our neural network outputs to the output of the 5 paper.
This setup used the following packages: python 3.6.9, PyRosett 4, Rosetta 3, Tensorflow 1.13.1, BioPython 1.76, and DSSP 3.01 and was run on GNU/Linux Ubuntu 19.10. Further information on running the setup can be found on this GitHub repository which includes an extensive README file. To train the neural network a 3GB GPU and 12GB of RAM is recommended, while to run the trained neural network and generate a backbone structure an Intel i7-2620M 2.70GHz CPU and 3GB of RAM is recommended.
As Detailed in Figure 2, executing the trained network will generate a set of random numbers that are pushed through the network which (using the weights) will modify the values to become values in accordance with the ϕ and ψ angle topologies observed in the training dataset. A simple straight chain of 150 valines is then computationally constructed and the ϕ and ψ angles are applied to each amino acid in order, resulting in the appearance of helical secondary structures. Any tailing loops at the COOH end will be cut out since it interferes with the next step. This structure is then relaxed using the FastRelax protocol which moves the ϕ and ψ angles randomly in its attempt to find the lowest scoring configuration compacting the structure in the process. A filter is applied to determine whether the final structure is ideal or not, its parameters are detailed in the Methods section. If the structure passes the filter the script exits, otherwise it repeats the whole process.
The dataset was named PS_Helix_500 due to the fact that the features used were the ϕ and ψ angles, only strictly helical protein structures were used, and each structure was augmented 500 times.
Minimal hyperparameter search was needed and was performed manually. The neural network was trained on the dataset for 18,000 epochs (further training collapsed the network, i.e. all outputs were exactly the same structure) with a generally sloping down mean loss as shown in Figure 3 indicating that the network got better at generating novel data compared to the original training data. The network was used to generate a sequence of ϕ and ψ angles for 25 structures. In other words, random numbers were generated then pushed through the network where they were modified (using the network’s trained weights) to become the ϕ and ψ angles for 150 residues. This is the angle profile. Using PyRosetta, a simple straight 150-valine chain was constructed and used as a primary structure. The generated ϕ and ψ angles were applied to this primary structure (each amino acid’s angles were changed according to the angle profile) resulting in a folded structure clearly showing helical secondary structures with loops between them. The last tailing loop at the COOH end was truncated, resulting in structures with variable sizes. The loop angles were within the area of the angles found in the dataset 1, but because they were generated independent of the thermodynamic stability of the structure, the structure did not come together into compact a topology. To push the structure into a thermodynamic energy minima it was relaxed using the PyRosetta FastRelax function which compacted the backbone topology while still having valines as a temporary placeholder sequence. This was repeated 25 times to result in the 25 structures in Figure 4. For comparison five control structures were generated using the de novo design protocol by the previous paper5. Figure 1B shows the Ramachandran plot of the 25 generated structures, where red are the amino acids within helices having angles clustering around the same location as Figure 1A (in the fourth quadrant), as is desired for an α-helix, which has ideal angles around (−60°, −45°), our results had an angle range for the helices (−127.4°<ϕ<−44.7°, −71.3°<ψ<30.6°) not including the outliers, the five control structures show angles within the same region (purple for helices and black for loops). The structure generation setup was not perfect at achieving an ideal structure every time, so a filter was deployed to filter out suboptimal structures by choosing a structure that had more residues within helices than within loops, was not smaller than 80 residues, and had more than 20% residues comprising its core. Due to the random nature of the structure generation the efficiency of the network is variable, while generating the 25 structure in Figure 4 the fasted structure was generated after just 4 failed attempts, while the slowest structure was generated after 3834 failed attempts giving a success rate for the network between 25.0% at its best and 0.025% at its worst, and this took between ~1 minute and ~6 hours to generate a single structure with the desired characteristics utilising just 1 core on a 2011 MacBook Pro with with 4 core Intel i7-2620M 2.70GHz CPU, 3GB RAM, and 120GB SSD. For comparison the protocol by 5 took ~60 minutes for each of the control structures on the same machine. The protocol is summarised in Figure 2, and the results are compiled in Figure 4 showing all 25 structures and the 5 controls.
The mean loss of the whole network over epoch, for 18,000 epochs, showing a general downward trend. This indicated that in subsequent epochs the network got better at generating structures.
This figure shows all 25 structures that were generated using the neural network, displayed using PyMOL33. It can be seen that all structures have compact helical structures of variable topologies. The 5 control structures at the bottom were generated using the protocol from the 5 paper showing better helices but similar compactness.
The RMSD values of the generated structures were measured against all structures in the dataset and the lowest values for each generated structure isolated, the RMSD values was measured using the CE algorithm from PyMOL. The lowest RMSD was 2.09 between structure 20 and PDBID 1K05 chain A, when the structures were analysed the secondary structures were similar but the final structures were of different sizes, though the structures align at the centre, it can still be inferred that structure 20 is very similar to structure 1K05 chain A from the dataset (Figure 6A). On the other hand, structures 13 and 4I0X chain L even though have an RMSD value of 2.43 since almost all of 4I0X chain L aligns with structure 13, it only has 2 helices, structure 13 has 2 extra helices with 63 extra residues, indicating that structure 13 is not seen within the dataset. The highest RMSD was 4.37 seen between structure 18 and 2J6Y chain E, when visualised they were very different, while some helices aligned at the centre the rest of the structures were widely different, hence the generated structure 18 is not seen within the dataset (Figure 6B).
This figure shows generated structures (red) aligned with their lowest RMSD structure from the dataset (cyan). 6A: generated structure number 20 (red) aligned with 1K05 chain A (cyan) showing great structural similarly resulting in an RMSD value of 2.09 indicating it is mostly likely a structure copied from the dataset. 6B: generated structure 18 (red) aligned with 2J6Y chain E (cyan) show great structural differences resulting in an RMSD value of 4.37 indicating this structure does not exist within the dataset.
As an additional evaluation step, five new structures were generated and sequence designed using RosettaDesign28–30 (script available here: https://github.com/sarisabban/RosettaDesign) then simulated using the AbinitioRelax folding simulation31, and predicted their secondary structures from their sequences using PSIPRED32, results are in Figure 7. It can be seen that the folding simulation did not result in low scoring structures with low RMSD values (funnel shaped plot) as is indicative of a successful folding simulation (the closest result to a funnel shaped plot was structure 5), we presume that this was mainly due to the large variability of the loops within the dataset, along with high RMSD fragments generated (fragments which are used to drive the folding simulation) at the loop positions. Although when the low score low RMSD simulated structures were structure aligned with the designed structures it was clear that they did align with most of the helices aligning in the expected locations, with the exception of structure 4 which totally failed.
This figure summarises the results of five structure predictions after being sequence designed using RosettaDesign. It can be seen that PSIPRED predicted helical secondary structures from their FASTA sequences. The AbinitioRelax folding simulation showed almost funnel shaped plots, even though each structure was simulated more than 1 million times (red dots). The red dots are compared to the green dots which are the score/RMSD positions of the designed structures. Structures 1, 2, 3, and 5 show that their lowest score lowest RMSD simulated structures has their helices aligned with the designed structure at several positions even through their folding simulations did not show a perfect funnel shaped plot, while structure 4 totally failed the simulation.
These 25 structures had on average 84.7% of their amino acids comprising their helices, along with and an average of 29.9% of their amino acids comprising their cores, Table 2 shows the Rosetta packing statistic for all 25 structures, along with the five controls, the top7 protein, and the five structures from the 5 paper all showing similar packing values.
The controls were the structures generated using the modified protocol from 5. For additional comparison the top7 (PDB ID: 1QYS)4 protein packing stat is measured along with the five structures from the protocol paper by 5.
The LSTM network was not the only network that was tested, an MLP-GAN34 was also tested on the same dataset, hyperparameter search was performed through a grid search then a random search to finally reach the following hyperparameters for the generator network: 4 layers with the following nodes 256, 512, 1024, and batch normalization with 0.15 momentum between the layers, and an output node with the shape of the dataset (150, 2). This network employed the ReLU activation for the hidden layers and the TanH activation function for the out put layer with the binary crossentropy loss and the Adam optimiser with a 0.0004 learning rate. The discriminator network had the following hyperparameters: 3 layers with the following nodes 512, 256, and 1 node for the output, and 25% dropout rate between the layers. This network employed the ReLU activation for the hidden layers and the sigmoid activation function for the output layer with the binary crossentropy loss and the Adam optimiser with a 0.002 learning rate. The architecture was trained for 100,000 epochs with a mini batch size of 512 and the results were not as good as the LSTM Figure 5; generating structures that did have helices and loops but much more sparse and kinked. Thus the LSTM is the preferred architecture for this particular dataset.
In this paper we outlined how a neural network architecture can design a compact helical protein backbone. We attempted to test our network to generate structures that included sheets, but that failed, mainly due to the wide variation in the loop angles that did not compact a structure to bring the strands together, which sheets rely on more to develop compared to helices.
We demonstrated that the ϕ and ψ angles were adequate features to design a protein backbone topology only without a sequence. Although we understand the distribution of angles in the Ramachandran plot (the distribution of helix and sheet angles) the neural network’s power comes in the form of combining several dimensional spaces to take a better decision. Given its observation of natural proteins, it can calculate the ideal combination of angles that results in deciding the number and lengths of helices, and the loops between them, that will still result in a compactly folded protein backbone.
The reason we concentrated our efforts on generating a backbone only is because once a backbone is developed it can be sequence-designed using other protocols, such as RosettaDesign4,14 or the protocol from 13.
Though our network had a wide variation of success rates, from high to low, that was due to the random nature of the setup, which was our target to begin with (to randomly generate backbone topologies rather than directly design a specific pre-determined topology). Generating multiple structures and auto filtering the suboptimal ones provided an adequate setup, this achieved our goal of de novo helical protein backbone design within a reasonable time (1–6 hours) on readily available machines.
As a control we used a slightly modified de novo design protocol from5,19–21, which also performs backbone design (resulting in an all-valine backbone topology) followed by sequence design of that backbone. It has numerous advantages such as generating better helices, but the user must still pre-determine the topology to be generated (decide the number of helices, their lengths and locations, and the lengths of the loops between them), while this neural network automatically takes that decision and randomly generates different backbones, which can be very useful for database generation (see below).
The neural network is available at this GitHub repository which includes an extensive README file, a video that explains how to use the script, as well as the dataset used in this paper and the weights files generated from training the neural network.
For future work we are currently working on an improved model that uses further dataset dimensions and features that will allow the design of sheets.
There are many benefits to generating numerous random protein structures computationally. One benefit can be in computational vaccine development, where a large diversity of protein structures as scaffolds are required for a successful graft of a desired motif35–37. Using this setup, combined with sequence design, a database of protein structures can be generated that provides more variety of scaffolds than what is available in the protein databank, especially that this neural network is designed to produce compact backbones between 80 and 150 amino acids, which is within the range of effective computational forward folding simulations such as AbinitioRelax structure prediction38.
Source code available from: https://sarisabban.github.io/RamaNet/.
Archived Source code at time of publication: 10.5281/zenodo.3755343
License: MIT License.
The corresponding author would like to thank the High Performance Computing Center at King Abdulaziz University for making available the Aziz high performance computer where the corresponding author was able to augment the dataset and perform the AbinitioRelax folding simulations.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
No
Is the description of the software tool technically sound?
No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
References
1. Watson JL, Juergens D, Bennett NR, Trippe BL, et al.: De novo design of protein structure and function with RFdiffusion.Nature. 2023; 620 (7976): 1089-1100 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Protein design
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: computational biology, protein structure, transmembrane proteins, short linear motifs, machine learning
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational protein biophysics
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
No
References
1. Anand N, Huang PS: Generative Modeling for Protein Structures. 32nd Conference on Neural Information Processing Systems (NeurIPS, 2018). 2018. Reference SourceCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Structural bioinformatics, machine learning
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Machine Learning
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||||
---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | |
Version 3 (revision) 12 Oct 20 |
read | read | read | ||
Version 2 (revision) 11 Sep 20 |
read | ||||
Version 1 27 Apr 20 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
1. As per reviewer's recommendation we added the linux_requirements.txt and python_requirements.txt files to the repository.
2. We added a paragraph in the introduction section about the related works the reviewer recommended.
3. ... Continue reading Corrections:
1. As per reviewer's recommendation we added the linux_requirements.txt and python_requirements.txt files to the repository.
2. We added a paragraph in the introduction section about the related works the reviewer recommended.
3. We measured the RMSD of each of the generated structures compaired to the structures in the dataset and added the data to the results section. This is to see whether or not the network generates structures within the dataset (overfitting). New figures added as well.
4. As the reviewer suggested, we added results showing 5 additional structures that were sequence designed and their folding simulated. These results were previously posted in the first iteration of the BioRxiv manuscript but we removed them for the following reason: we were unsure whether the sequence design / folding simulation was an adequate evaluation criteria for our particular results, reason is because the AbinitioRelax folding simulation does not always successfully fold natural structures (even natural crystal structures). Take the example between 1YN3 and 4WYH (where we tried to use as controls) here Abinitio successfully folds 1YN3 but not 4WYH, thus 1YN3 can be sequence designed and folded but 4WYH cannot be sequence designed and folded even though it is a natural helical protein structure (results can be provided). Given this, when we sequence designed and folded our structures we could not determine whether they were not successfully folding because their backbone are not logical, or because Abinitio fails at folding those particular backbones, or because good fragments could not be generated for those particular structures, or because the sequence design step failed, especially that the network is producing random structures. We did however get successful secondary structure predictions from PSIPRED after sequence design, further explanation is in the results section of the new manuscript version. Nonetheless we re-added just the Abinitio folding simulation results and we hope this will give further and more accurate insight into our setup.
1. As per reviewer's recommendation we added the linux_requirements.txt and python_requirements.txt files to the repository.
2. We added a paragraph in the introduction section about the related works the reviewer recommended.
3. We measured the RMSD of each of the generated structures compaired to the structures in the dataset and added the data to the results section. This is to see whether or not the network generates structures within the dataset (overfitting). New figures added as well.
4. As the reviewer suggested, we added results showing 5 additional structures that were sequence designed and their folding simulated. These results were previously posted in the first iteration of the BioRxiv manuscript but we removed them for the following reason: we were unsure whether the sequence design / folding simulation was an adequate evaluation criteria for our particular results, reason is because the AbinitioRelax folding simulation does not always successfully fold natural structures (even natural crystal structures). Take the example between 1YN3 and 4WYH (where we tried to use as controls) here Abinitio successfully folds 1YN3 but not 4WYH, thus 1YN3 can be sequence designed and folded but 4WYH cannot be sequence designed and folded even though it is a natural helical protein structure (results can be provided). Given this, when we sequence designed and folded our structures we could not determine whether they were not successfully folding because their backbone are not logical, or because Abinitio fails at folding those particular backbones, or because good fragments could not be generated for those particular structures, or because the sequence design step failed, especially that the network is producing random structures. We did however get successful secondary structure predictions from PSIPRED after sequence design, further explanation is in the results section of the new manuscript version. Nonetheless we re-added just the Abinitio folding simulation results and we hope this will give further and more accurate insight into our setup.