PREMONITION - Preprocessing motifs in protein structures for search acceleration [version 1; peer review: 2 approved with reservations, 1 not approved]

The remarkable diversity in biological systems is rooted in the ability of the twenty naturally occurring amino acids to perform multifarious catalytic functions by creating unique structural scaffolds known as the active site. Finding such structrual motifs within the protein structure is a key aspect of many computational methods. The algorithm for obtaining combinations of motifs of a certain length, although polynomial in complexity, runs in non-trivial computer time. Also, the search space expands considerably if stereochemically equivalent residues are allowed to replace an amino acid in the motif. In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION). PREMONITION rolls a sphere of radius R along the protein fold centered at the C atom of each residue, and all possible motifs are extracted within this sphere. The number of residues that can occur within a sphere centered around a residue is bounded by physical constraints, thus setting an upper limit on the processing times. After such a pre-compilation


Introduction
The rapid development of crystallization techniques has resulted in a deluge of proteins with known structures 1 .Most of the proteins are annotated using sequence alignment methods by a 'guilt by association' logic based on the sequence-to-structure-to-function paradigm 2 .However, sequence alignment methods are not applicable in cases where similar functional groups are identically positioned in the active site of proteins with no sequence homology.The classic example of this phenomenon, known as convergent evolution 3,4 , is the major families of serine proteases (chymotrypsin and subtilisin), where the active site is structurally and functionally identical, though there is no global sequence or structural homology 5 .According to some studies, about 42% of entries annotated as 'unknown functions' are true examples of proteins of unknown function 6 .
Structure-based methods have evolved to detect such convergently evolved proteins 7,8 .The conservation of structural properties is the primary driving logic behind many of these identification methods, reviewed in detail previously 9 .There are essentially two categories of programs that find binding sites in proteins (binding sites are typically closely related to protein function).The first one requires a predefined set of amino acids (motifs) of a known enzymatic function to search for the same within the protein under investigation 8,[10][11][12][13][14] .The second category automatically detects similarity in the side chain patterns to classify protein functionality 7,[15][16][17][18] .We have demonstrated through several detailed examples, using a method (CLASP 19 ) which falls in the first category, that such structural conservation necessitates the conservation of electrostatic properties in proteins with the same functionality [19][20][21][22] .
A challenge emerging in these methods relates to the large fold space of known proteins, although the rate of increase of this space is gradually being saturated 23 .Efficient parallelization has allowed the ProBiS algorithm to compare a protein query against the PDB in minutes 15 .To date, the identification of motifs is a task executed on the fly and applied sequentially 13,14 .Thus, running multiple queries involves several invocations of the same program.Our aim is to amortize the processing times by a one-time precompilation of all possible motifs, pruned using rational distance constraints, which can be leveraged for future queries.
A simplistic approach to obtain motifs is to enumerate all possible combinations from the sequence.Motifs that span across distances rarely seen in active sites can be pruned out using structural information.In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION -Preprocessing motifs in protein structures for search acceleration).We have estimated R from the known active site residues of ~500 proteins annotated in the CSA database (http://www.ebi.ac.uk/thorntonsrv/databases/CSA/) 24 .PREMONITION rolls a sphere of radius R along the protein fold centered at the Cα atom of each residue, and all motifs are extracted within this sphere.The maximum number of residues that occurs within R Å of any residue in protein dataset is also computed.This sets an upper bound for the polynomial complexity of the PREMONITION algorithm, and run times for the precompilation are reasonable.After such a precompilation step, the computational time required for querying a protein structure with multiple motifs is reduced considerably.
Previously, we had proposed a computational method (PROMISE) to estimate the promiscuity of proteins with known active site residues and 3D structure using the CSA database 25 .It took less than a minute to query one protein with all 500 motifs using PREMONI-TION, a process that took almost a day when done sequentially.Such speed up enables querying a much larger set of proteins using a comprehensive set of ligands, as is required in drug screening procedures.

Materials and methods
Algorithm 1 details the steps in creating the PREMONITION database for a given protein.We enumerate the steps using a concrete example for trypsin (PDBid:1A0J).Given a radius of interaction SOAS, we first compute the set residues within SOAS for each residue Residue i .For example, let us assume we are processing residue D102.Equation 1gives the set of residues that have at least one atom within SOAS=10 Å from the Cα of D102.
We now take combinations of n=4 from this set, which does not necessarily include D102 (Equation 2).
Now, we add (102,90,227,237) to the global tableofmatches for the key 'DMVW'.Thus, as we process every residue in the protein, we merge all occurrences of 'DMVW' (Equation 4). ϕ Extracting all motifs of 'DMVW' now consists of the trivial task of reading this set from the file on disk.
Adaptive Poisson-Boltzmann Solver (APBS) and PDB2PQR packages were used to calculate the potential difference between the reactive atoms of the corresponding proteins 26,27 .The APBS parameters and electrostatic potential units were set as described previously in 19 .Protein structures were rendered by PyMol (http://www.pymol.org/).The proteins were superimposed based on the matching motifs using DECAAF 28 .

Results and discussion
Estimating the maximum radius for computing interacting residues First, we estimated the minimal radius of a sphere that encompasses active sites found in proteins.CSA provides catalytic residue annotation for enzymes in the PDB and is available online 24 .The database consists of an original hand-annotated set extracted from the primary literature and a homologous set inferred by PSI-BLAST 2 .We chose ~500 proteins from the CSA database that are annotated from the literature (SI list.doc).We computed the size of the active site (SOAS) for these known active sites by finding the minimum radius centered around one residue that encompassed the other residues.Table 1 shows the pairwise distance for the four residues that comprise the active site in trypsin (PDBid:1A0J) -Asp102, Ser195, His57 and Ala56.It can be seen that a sphere of radius 6.9 Å centered around His57 (c) would include all other residues.Other radii required to encompass all other residues of the motif and centered at a different residue is larger than this value (Asp102 = 7.8 Å, Ser195 = 9 Å and Ala56 = 9 Å).Thus, the SOAS for this protein is 6.9 Å. Figure 1a shows the frequency distribution of the SOAS for the set of 500 proteins chosen (mean = 7.3 Å, standard deviation = 1.8 Å, min = 3.5 Å and max = 13 Å).90% of proteins have an SOAS below 10 Å.
Number of residues within a sphere of radius 10 Å Next, we estimated the number of residues that fall within the SOAS radius in different proteins.PREMONITION takes combinations of 4 residues from each of this set (size N), and is thus polynomial in complexity (O(N) = N 4 ).The value of upper bound on N needs to be known to ensure that runtimes are reasonable.
Figure 1b shows the probability distribution of the number of residues that lie within a SOAS of 10 or 11 Å) for all residues in proteins in our dataset.A residue R j is considered to be within the SOAS of another residue R i if any atom of R j falls within a sphere of radius SOAS centered around the Cα of R i .Out of a total of 174780 residues in these proteins, the maximum number of residues found within the SOAS of 10 Å is 58.Thus, the upper bound on the number of possible combinations for one residue is 4 C 58 = 424270, which is quite tractable (as can be seen from runtimes below).Note, that although the algorithm is polynomial in complexity, this number increases rapidly with increasing motif length, as well as the SOAS.For example, for a motif length of 5, the number of possible combinations in a set of 58 residues is 4582116 (ten times the number for a motif length of 4).However, a 4 residue motif is sufficient to represent most active site conformations, and for a preliminary search on extensive datasets.Similarly, for a SOAS of  1. Pairwise distance between the active site residues in trypsin (PDBid:1A0J).The motif consists of Asp102/OD1(a), Ser195/OG(b), His57/ND1(c) and Ala56/N(d).A sphere of radius 6.9 Å centered around His57 (c) would include all other active site residues.This is the minimal distance -the radius of any sphere needed to include all residues is more than this value.Running CLASP using the PREMONITION modified algorithm We queried trypsin (PDBid:1A0J) with all the 500 motifs using the modified CLASP algorithm using the PREMONITION database 19 .The best matches are shown in Table 2.As expected, the best matches are those with known serine catalytic triads.As an illustrative example, we chose a protein (thioesterase -PDBid:1THT) with no known relationship with trypsin.This protein has the active site motif -H241 S114 V136 W213.This corresponds to the string query 'HSVW' (note that the string is sorted).We extract all entries for this string (Equation 5), which are all possible occurrences of the structural motif 'HSVW' in the protein.All entries of 'HSVW' are compared using CLASP.Table 3 shows the electrostatic potential difference and spatial difference in each of the motifs in Equation 5.The best scoring motif is (H57ND1,S195OG,V213N,W215NE1).However, even the best motif has a relatively large RMSD (Figure 2).Thus, this is not a significant match.

Atom1
Runtimes and disk space Previously, we had proposed a computational method (PROMISE) to estimate the promiscuity of proteins with known active site residues and 3D structure using the CSA database 25 .PROMISE used each of the 500 proteins with known active site residues extracted from the CSA database to query every other protein in that set.This procedure required ~500*500=250000 program calls of CLASP 19 , each of which took one minute on an average.Thus, the total time taken was a month on a parallel system 25 .Using PREMONITION, it took less than a minute to query one protein with all 500 motifs.Thus, it took less than a day to replicate the PROMISE results.The precompilation step of extracting all motifs takes approximately 15 minutes on average for a single protein.For the protein with the largest SOAS (PDBid:1GPJ -13 Å), the precompilation took one hour.These are acceptable values for a one time processing.The

Juliana Bernardes
Programa de Engenharia de Sistemas e Computação, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil The work proposes a simple procedure for accelerating the search for structural motifs.It precompiles all motifs of size n within a radius R from protein structures and use this motif table to detect faster matches between a query sequence and proteins with known structure.
Major points: This procedure is a trivial step since motif searches are not performed sequentially.A motif search algorithm must pre-compile of possible motifs in order to make the search feasible.In my opinion, it is a detail of CLASP implementation and the method is insufficient to justify a full Method Article -maybe a short paper.

Minor:
The paper is confusing and English must be improved.

Competing Interests:
No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Major points:
The choice of considering combination of n=4 from the set of residues within the SOAS distance is not sufficiently explained and should be discussed.The authors write: "However, a 4 residue motif is sufficient to represent most active site conformations, and for a preliminary search on extensive datasets."This claim should be justified and compared with similar choices in other algorithms performing similar tasks.

○
The part discussing the illustrative example on thioesterase should be rewritten.First, it is not clear what the authors intend to show.Secondly, the properties reported in Table 3 (potential and spatial congruence) are not introduced anywhere in the paper and no discussion is provided about them.Finally, the criteria for the selection of the best scoring

○
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

ϕ 5 )Figure 1 .
Figure 1.Estimating the radius of the sphere enclosing the active site in proteins, and the number of residues in the sphere.Data is extracted from ~500 proteins from the CSA database which are annotated from literature.(a) The size of the active site (SOAS) computed using the minimum radius centered around one residue that encompasses the other residues.(b) Number of residues enclosed by the SOAS (Red=10 Å, blue=11 Å).A residue R j is considered to be within the SOAS of another residue R i if any atom of R j falls within a sphere of radius SOAS centered around the Cα of R i .

Table 3 .Figure 2 .
Figure 2. Superimposing thioesterase (PDBid:1THT, in blue) and trypsin (PDBid:1A0J, in green).The proteins are superimposed based on the matching active site residues using DECAAF 28 .(a) The global superimposition does not show any significant homology.(b) The detailed residue configuration.Residues from trypsin are in red, and those in the thioesterase are in blue.His57 and His241 completely overlap and are in black.

Reviewer
Report 06 October 2014 https://doi.org/10.5256/f1000research.5510.r6329© 2014 Ciurli S et al.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Stefano Ciurli Laboratory of Bioinorganic Chemistry, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy Francesco Musiani Laboratory of Bioinorganic Chemistry, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy The manuscript from Chakraborty et al. reports an algorithm aimed to accelerate the search in protein's active site data bases.The algorithm precompiles all the possible motifs comprising a set of n=4 amino acids.
Radius of sphere centered around each residue begin /* Table mapping string of amino acids of length n to list of indices */ Input: P 1 : Reference protein Input: n: number of residues in the motif Input: SOAS:

Table 2 . Best matches when trypsin (PDBid:1A0J) is queried using 500 motifs from the CSA database.
As expected, the best matches are those with known catalytic triads.