ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

PREMONITION - Preprocessing motifs in protein structures for search acceleration

[version 1; peer review: 2 approved with reservations, 1 not approved]
PUBLISHED 10 Sep 2014
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

The remarkable diversity in biological systems is rooted in the ability of the twenty naturally occurring amino acids to perform multifarious catalytic functions by creating unique structural scaffolds known as the active site. Finding such structrual motifs within the protein structure is a key aspect of many computational methods. The algorithm for obtaining combinations of motifs of a certain length, although polynomial in complexity, runs in non-trivial computer time. Also, the search space expands considerably if stereochemically equivalent residues are allowed to replace an amino acid in the motif. In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION). PREMONITION rolls a sphere of radius R along the protein fold centered at the C atom of each residue, and all possible motifs are extracted within this sphere. The number of residues that can occur within a sphere centered around a residue is bounded by physical constraints, thus setting an upper limit on the processing times. After such a pre-compilation step, the computational time required for querying a protein structure with multiple motifs is considerably reduced. Previously, we had proposed a computational method to estimate the promiscuity of proteins with known active site residues and 3D structure using a database of known active sites in proteins (CSA) by querying each protein with the active site motif of every other residue. The runtimes for such a comparison is reduced from days to hours using the PREMONITION methodology.

Introduction

The rapid development of crystallization techniques has resulted in a deluge of proteins with known structures1. Most of the proteins are annotated using sequence alignment methods by a ‘guilt by association’ logic based on the sequence-to-structure-to-function paradigm2. However, sequence alignment methods are not applicable in cases where similar functional groups are identically positioned in the active site of proteins with no sequence homology. The classic example of this phenomenon, known as convergent evolution3,4, is the major families of serine proteases (chymotrypsin and subtilisin), where the active site is structurally and functionally identical, though there is no global sequence or structural homology5. According to some studies, about 42% of entries annotated as ‘unknown functions’ are true examples of proteins of unknown function6.

Structure-based methods have evolved to detect such convergently evolved proteins7,8. The conservation of structural properties is the primary driving logic behind many of these identification methods, reviewed in detail previously9. There are essentially two categories of programs that find binding sites in proteins (binding sites are typically closely related to protein function). The first one requires a predefined set of amino acids (motifs) of a known enzymatic function to search for the same within the protein under investigation8,1014. The second category automatically detects similarity in the side chain patterns to classify protein functionality7,1518. We have demonstrated through several detailed examples, using a method (CLASP19) which falls in the first category, that such structural conservation necessitates the conservation of electrostatic properties in proteins with the same functionality1922.

A challenge emerging in these methods relates to the large fold space of known proteins, although the rate of increase of this space is gradually being saturated23. Efficient parallelization has allowed the ProBiS algorithm to compare a protein query against the PDB in minutes15. To date, the identification of motifs is a task executed on the fly and applied sequentially13,14. Thus, running multiple queries involves several invocations of the same program. Our aim is to amortize the processing times by a one-time precompilation of all possible motifs, pruned using rational distance constraints, which can be leveraged for future queries.

A simplistic approach to obtain motifs is to enumerate all possible combinations from the sequence. Motifs that span across distances rarely seen in active sites can be pruned out using structural information. In the present work, we propose a method to precompile all possible motifs comprising of a set (n=4 in this case) of predefined amino acid residues from a protein structure that occur within a specified distance (R) of each other (PREMONITION - Preprocessing motifs in protein structures for search acceleration). We have estimated R from the known active site residues of ~500 proteins annotated in the CSA database (http://www.ebi.ac.uk/thornton-srv/databases/CSA/)24. PREMONITION rolls a sphere of radius R along the protein fold centered at the Cα atom of each residue, and all motifs are extracted within this sphere. The maximum number of residues that occurs within R Å of any residue in protein dataset is also computed. This sets an upper bound for the polynomial complexity of the PREMONITION algorithm, and run times for the precompilation are reasonable. After such a precompilation step, the computational time required for querying a protein structure with multiple motifs is reduced considerably.

Previously, we had proposed a computational method (PROMISE) to estimate the promiscuity of proteins with known active site residues and 3D structure using the CSA database25. It took less than a minute to query one protein with all 500 motifs using PREMONITION, a process that took almost a day when done sequentially. Such speed up enables querying a much larger set of proteins using a comprehensive set of ligands, as is required in drug screening procedures.

Materials and methods

Algorithm 1 details the steps in creating the PREMONITION database for a given protein. We enumerate the steps using a concrete example for trypsin (PDBid:1A0J). Given a radius of interaction SOAS, we first compute the set residues within SOAS for each residue Residuei. For example, let us assume we are processing residue D102. Equation 1 gives the set of residues that have at least one atom within SOAS=10 Å from the Cα of D102.

      ϕNearestResiduesD102 = [D102, W237, M90, V227, M180, G38, R62, S37, D194...]         (1)

We now take combinations of n=4 from this set, which does not necessarily include D102 (Equation 2).

      ϕCombinationsD102 = [(W237, M90, D102, V227), (W237, M90, D194, M180), (G38, R62, S37, D102)...]         (2)

After sorting the each combination based on the single letter amino acid code we obtain a set of n tuples (Equation 3).

      ϕSortedStringsD102 = [DMVW = (102, 90, 227, 237), DDQV = (194, 90, 180, 237), DGRS = (102, 38, 62, 37)...]       (3)

Now, we add (102,90,227,237) to the global tableofmatches for the key ‘DMVW’. Thus, as we process every residue in the protein, we merge all occurrences of ‘DMVW’ (Equation 4).

      ϕAllMotifsDMVW = [(194.104.53.141), (102.90.227.237), (194.104.52.51), (194.104.53.51)...]       (4)

Extracting all motifs of ‘DMVW’ now consists of the trivial task of reading this set from the file on disk.

Adaptive Poisson-Boltzmann Solver (APBS) and PDB2PQR packages were used to calculate the potential difference between the reactive atoms of the corresponding proteins26,27. The APBS parameters and electrostatic potential units were set as described previously in19. Protein structures were rendered by PyMol (http://www.pymol.org/). The proteins were superimposed based on the matching motifs using DECAAF28.

Algorithm 1. Premonition()

   Input: P1 : Reference protein

   Input: n: number of residues in the motif

   Input: SOAS: Radius of sphere centered around each residue

   begin

        /* Table mapping string of amino acids of length n to list of indices */

        tableofmatches = ∅ ;

        ϕca = Cα atoms of all residues ;

        foreach CAi in ϕca do

             ϕNearestResidues = FindResiduesWithinDist(CAi, SOAS);

             ϕCombinations = GetCombinationsof_n(ϕNearestResidues, n);

             ϕSortedStrings = SortBasedOnAminoAcidName(ϕSortedCombinations);

             InsertInTable(ϕSortedStrings,tableofmatches);

        end

        /* Output in file */

        foreach string in tableofmatches do

             ϕMotifsstring = GetMotifsforEachString(tableofmatches);

             PrintListofMatches(ϕMotifsstring);

        end

   end

Results and discussion

Estimating the maximum radius for computing interacting residues

First, we estimated the minimal radius of a sphere that encompasses active sites found in proteins. CSA provides catalytic residue annotation for enzymes in the PDB and is available online24. The database consists of an original hand-annotated set extracted from the primary literature and a homologous set inferred by PSI-BLAST2. We chose ~500 proteins from the CSA database that are annotated from the literature (SI list.doc). We computed the size of the active site (SOAS) for these known active sites by finding the minimum radius centered around one residue that encompassed the other residues. Table 1 shows the pairwise distance for the four residues that comprise the active site in trypsin (PDBid:1A0J) - Asp102, Ser195, His57 and Ala56. It can be seen that a sphere of radius 6.9 Å centered around His57 (c) would include all other residues. Other radii required to encompass all other residues of the motif and centered at a different residue is larger than this value (Asp102 = 7.8 Å, Ser195 = 9 Å and Ala56 = 9 Å). Thus, the SOAS for this protein is 6.9 Å. Figure 1a shows the frequency distribution of the SOAS for the set of 500 proteins chosen (mean = 7.3 Å, standard deviation = 1.8 Å, min = 3.5 Å and max = 13 Å). 90% of proteins have an SOAS below 10 Å.

Table 1. Pairwise distance between the active site residues in trypsin (PDBid:1A0J).

The motif consists of Asp102/OD1(a), Ser195/OG(b), His57/ND1(c) and Ala56/N(d). A sphere of radius 6.9 Å centered around His57 (c) would include all other active site residues. This is the minimal distance - the radius of any sphere needed to include all residues is more than this value.

Atom1Atom2Distance(Å)
ab7.8
ac5.6
ad2.9
bc3.3
bd9.0
cd6.9
3264b447-919e-4f38-a23e-a05b6e31fa10_figure1.gif

Figure 1. Estimating the radius of the sphere enclosing the active site in proteins, and the number of residues in the sphere.

Data is extracted from ~500 proteins from the CSA database which are annotated from literature. (a) The size of the active site (SOAS) computed using the minimum radius centered around one residue that encompasses the other residues. (b) Number of residues enclosed by the SOAS (Red=10 Å, blue=11 Å). A residue Rj is considered to be within the SOAS of another residue Ri if any atom of Rj falls within a sphere of radius SOAS centered around the Cα of Ri.

Number of residues within a sphere of radius 10 Å

Next, we estimated the number of residues that fall within the SOAS radius in different proteins. PREMONITION takes combinations of 4 residues from each of this set (size N), and is thus polynomial in complexity (O(N) = N4). The value of upper bound on N needs to be known to ensure that runtimes are reasonable.

Figure 1b shows the probability distribution of the number of residues that lie within a SOAS of 10 or 11 Å) for all residues in proteins in our dataset. A residue Rj is considered to be within the SOAS of another residue Ri if any atom of Rj falls within a sphere of radius SOAS centered around the Cα of Ri. Out of a total of 174780 residues in these proteins, the maximum number of residues found within the SOAS of 10 Å is 58. Thus, the upper bound on the number of possible combinations for one residue is 4C58 = 424270, which is quite tractable (as can be seen from runtimes below). Note, that although the algorithm is polynomial in complexity, this number increases rapidly with increasing motif length, as well as the SOAS. For example, for a motif length of 5, the number of possible combinations in a set of 58 residues is 4582116 (ten times the number for a motif length of 4). However, a 4 residue motif is sufficient to represent most active site conformations, and for a preliminary search on extensive datasets. Similarly, for a SOAS of 11 Å we obtain the maximum number of residues as 70 - which results in 4C70 (= 916895) possible combinations (twice the number for a SOAS of 10 Å).

Running CLASP using the PREMONITION modified algorithm

We queried trypsin (PDBid:1A0J) with all the 500 motifs using the modified CLASP algorithm using the PREMONITION database19. The best matches are shown in Table 2. As expected, the best matches are those with known serine catalytic triads. As an illustrative example, we chose a protein (thioesterase - PDBid:1THT) with no known relationship with trypsin. This protein has the active site motif - H241 S114 V136 W213. This corresponds to the string query ‘HSVW’ (note that the string is sorted). We extract all entries for this string (Equation 5), which are all possible occurrences of the structural motif ‘HSVW’ in the protein.

Table 2. Best matches when trypsin (PDBid:1A0J) is queried using 500 motifs from the CSA database.

As expected, the best matches are those with known catalytic triads.

PDBLengthDescriptionCLASP Score
1A0J223Trypsin0
1SSX198Alpha-lytic protease0.2
1AZW313Proline iminopeptidase0.7
1MEK120Protein disulfide isomerase1
2LPR198Alpha-lytic protease1.2
1SCA274Subtilisin carlsberg1.2
1C4X2852-Hydroxy-6-oxo-6-phenylhexa-2,4-die1.3
1A7U277Chloroperoxidase T1.3
1LJL131Arsenate reductase1.3
1QJ4257Hydroxynitrile lyase1.3
1RGQ200NS3 Protease1.3
1JKM361Esterase1.4
1EH5279Palmitoyl protein thioesterase 11.4
2AAT396Aspartate aminotransferase1.4
1THT305Thioesterase1.5

      ϕMotifsHSVW = [(71, 26, 121, 141), (71, 26, 138, 141), (57, 214, 212, 215)...]        (5)

All entries of ‘HSVW’ are compared using CLASP. Table 3 shows the electrostatic potential difference and spatial difference in each of the motifs in Equation 5. The best scoring motif is (H57ND1,S195OG,V213N,W215NE1). However, even the best motif has a relatively large RMSD (Figure 2). Thus, this is not a significant match.

Table 3. Potential and spatial congruence of the motif (H241 S114 V136 W213) from a thioesterase (PDB:1THT) in a trypsin protein (PDB:1A0J).

This motif corresponds to the key “HSVW”. D = Pairwise distance in Å. PD = Pairwise potential difference. APBS writes out the electrostatic potential in dimensionless units of kT/e where k is Boltzmann’s constant, T is the temperature in K and e is the charge of an electron.

PDBActive site atoms
(a,b,c,d)
abacadbcbdcd
1THTH241ND1,S114OG,
VAL136N,W213NE1,
D
PD
4.7
17.0
6.1
-97.0
6.3
-26.5
7.2
-114.0
9.0
-43.5
12.3
70.5
1A0JH71ND1,S26OG,
VAL121N,W141NE1,
D
PD
7.1
-288.7
16.3
-334.9
6.4
-227.3
13.4
-46.2
11.9
61.4
17.5
107.6
H71ND1,S26OG,
VAL138N,W141NE1,
D
PD
7.1
-288.7
13.4
-346.2
6.4
-227.3
10.1
-57.5
11.9
61.4
13.5
118.9
H57ND1,S195OG,
V212N,W215NE1,
D
PD
4.8
-20.7
7.9
-130.7
7.9
-52.3
6.7
-110.0
9.6
-31.6
9.7
78.4
3264b447-919e-4f38-a23e-a05b6e31fa10_figure2.gif

Figure 2. Superimposing thioesterase (PDBid:1THT, in blue) and trypsin (PDBid:1A0J, in green).

The proteins are superimposed based on the matching active site residues using DECAAF28. (a) The global superimposition does not show any significant homology. (b) The detailed residue configuration. Residues from trypsin are in red, and those in the thioesterase are in blue. His57 and His241 completely overlap and are in black.

Runtimes and disk space

Previously, we had proposed a computational method (PROMISE) to estimate the promiscuity of proteins with known active site residues and 3D structure using the CSA database25. PROMISE used each of the 500 proteins with known active site residues extracted from the CSA database to query every other protein in that set. This procedure required ~500*500=250000 program calls of CLASP19, each of which took one minute on an average. Thus, the total time taken was a month on a parallel system25. Using PREMONITION, it took less than a minute to query one protein with all 500 motifs. Thus, it took less than a day to replicate the PROMISE results. The precompilation step of extracting all motifs takes approximately 15 minutes on average for a single protein. For the protein with the largest SOAS (PDBid:1GPJ - 13 Å), the precompilation took one hour. These are acceptable values for a one time processing. The disk space for one protein PREMONITION file (zipped) is 14MB on average (7GB for 500 proteins).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Sep 2014
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Chakraborty S, Rao BJ, Asgeirsson B et al. PREMONITION - Preprocessing motifs in protein structures for search acceleration [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2014, 3:217 (https://doi.org/10.12688/f1000research.5166.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 10 Sep 2014
Views
9
Cite
Reviewer Report 24 Mar 2015
Xavier Barril, Departament de Fisicoquímica and Institut de Biomedicina (IBUB), Facultat de Farmàcia, Universtitat de Barcelona, Barcelona, Spain 
Approved with Reservations
VIEWS 9
I concur with all comments made by the first referee. Additionally, it would be necessary to demonstrate that the method provides sound results using a benchmark set, comparing the results obtained with the original CLASP implementation. As it is, it ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Barril X. Reviewer Report For: PREMONITION - Preprocessing motifs in protein structures for search acceleration [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2014, 3:217 (https://doi.org/10.5256/f1000research.5510.r7706)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
18
Cite
Reviewer Report 24 Mar 2015
Juliana Bernardes, Programa de Engenharia de Sistemas e Computação, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil 
Not Approved
VIEWS 18
The work proposes a simple procedure for accelerating the search for structural motifs. It pre-compiles all motifs of size n within a radius R from protein structures and use this motif table to detect faster matches between a query sequence ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Bernardes J. Reviewer Report For: PREMONITION - Preprocessing motifs in protein structures for search acceleration [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2014, 3:217 (https://doi.org/10.5256/f1000research.5510.r7709)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
25
Cite
Reviewer Report 06 Oct 2014
Stefano Ciurli, Laboratory of Bioinorganic Chemistry, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy 
Francesco Musiani, Laboratory of Bioinorganic Chemistry, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy 
Approved with Reservations
VIEWS 25
The manuscript from Chakraborty et al. reports an algorithm aimed to accelerate the search in protein’s active site data bases. The algorithm precompiles all the possible motifs comprising a set of n=4 amino acids.

Major points:
  • The choice of considering combination of
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ciurli S and Musiani F. Reviewer Report For: PREMONITION - Preprocessing motifs in protein structures for search acceleration [version 1; peer review: 2 approved with reservations, 1 not approved]. F1000Research 2014, 3:217 (https://doi.org/10.5256/f1000research.5510.r6329)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 10 Sep 2014
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.