Protein structure quality assessment based on the distance profiles of consecutive backbone Cα atoms

Predicting the three dimensional native state structure of a protein from its primary sequence is an unsolved grand challenge in molecular biology. Two main computational approaches have evolved to obtain the structure from the protein sequence - ab initio/de novo methods and template-based modeling - both of which typically generate multiple possible native state structures. Model quality assessment programs (MQAP) validate these predicted structures in order to identify the correct native state structure. Here, we propose a MQAP for assessing the quality of protein structures based on the distances of consecutive Cα atoms. We hypothesize that the root-mean-square deviation of the distance of consecutive Cα (RDCC) atoms from the ideal value of 3.8 Å, derived from a statistical analysis of high quality protein structures (top100H database), is minimized in native structures. Based on tests with the top100H set, we propose a RDCC cutoff value of 0.012 Å, above which a structure can be filtered out as a non-native structure. We applied the RDCC discriminator on decoy sets from the Decoys 'R' Us database to show that the native structures in all decoy sets tested have RDCC below the 0.012 Å cutoff. While most decoy sets were either indistinguishable using this discriminator or had very few violations, all the decoy structures in the fisa decoy set were discriminated by applying the RDCC criterion. This highlights the physical non-viability of the fisa decoy set, and possible issues in benchmarking other methods using this set. The source code and manual is made available at https://github.com/sanchak/mqap and permanently available on 10.5281/zenodo.7134.

The structure of a protein is a veritable source of information about its physiological relevance in the cellular context 1 . In spite of rapid technical advances in crystallization techniques, the number of protein sequences known far exceeds the known structures. There are essentially two different computational approaches to predict protein structures from its primary sequence: 1) Template based methods (TBM) which are based on features obtained from the database of known protein structures 2-4 and 2) ab initio or de novo methods which are based on the intrinsic laws governing atomic interactions and are applicable in the absence of a template structure with significant sequence homology 5,6 . While at present TBM methods fare much better than the de novo approaches, the requirement of a known template protein can sometimes be a constraining factor. Both these methods typically generate multiple possibilities for the native structure of a given sequence. Selecting the best candidate from the set of putative structures is an essential aspect that is performed by model quality assessment programs (MQAP).
MQAPs can be classified as energy based, consensus based or knowledge based. The refinement of structures based on modeling of atomic interactions in energy based methods, such as molecular dynamic simulations, are subject to limited sampling of possible conformations due to large run times, and force field inaccuracies due to the approximations involved in describing the dynamics of large multi-atomic systems 7-10 . Consensus methods are based on the principle that sub-structures of the native structure are likely to feature frequently in a set of near-native structures 11-14 . These methods are currently the best performing amongst MQAPs 13 , but are prone to be computationally intensive due structure-to-structure comparison of all models 14 , and are of limited use when the number of possible structures is small 15 . Knowledge based methods rely on the assignment of an empirical potential (also known as statistical potential) from the frequency of residue contacts in the known structures of native proteins 16,17 . In statistical physics, for a system in thermodynamic equilibrium, the accessible states are populated with a frequency which depends on the free energy of the state and is given by the Boltzmann distribution. The Boltzmann hypothesis states that if the database of known native protein structures is assumed to be a statistical system in thermodynamic equilibrium, specific structural features would be populated based on the free energy of the protein conformational state. Sippl argued using a converse logic that the frequencies of occurrence of structural features such as interatomic distances in the database of known protein structures could determine a free energy (potential of mean force) for a given protein conformation, and thus be used to discriminate the native structure 18,19 . A crucial aspect in applying statistical potentials is the proper characterization of the reference state 20 . The application of such empirical energy functions to predict and assess protein structures, while quite popular, are vigorously debated 21,22 , and several approaches for using statistical potentials for protein structure prediction been described to date 20,23-26 .
Here, we propose a new statistical potential based MQAP for assessing the quality of protein structures based on the distances of consecutive Ca atoms -Protein structure quality assessing based on Distance profile of backbone atoms (PROQUAD). We first propose a statistical potential based on the distance of consecutive Ca distances. In a set of high quality protein structures (top100H 27 ), we demonstrate that the distance between consecutive Ca atoms are distributed normally with a mean of 3.8 Å and standard deviation of 0.04 Å. Based on this observation, the reference state for our statistical potential calculations is defined as one where all consecutive Ca atoms are 3.8 Å apart. We propose a scoring function which measures the deviation of consecutive Ca atoms from 3.8 Å, and hypothesize that this score is minimized in native structures. Based on the top100H database, we chose a cutoff of 0.012 Å for this scoring function to identify non-native states. We show that all the decoy structures from the fisa decoy set taken from the Decoys 'R' Us database 28 are distinguished using this discriminator. It has been previously proposed that native structures have constrained interatomic distances 29 . Interatomic distances, and other metrics, have been combined in several such methods -Molprobity (http:// molprobity.biochem.duke.edu/), PROSA (https://prosa.services. came.sbg.ac.at/prosa.php) and the WHATIF server (http://swift. cmbi.ru.nl/whatif) [30][31][32] . These identify possible anomalies in a given protein structure. While Molprobity and WHATIF identified steric clashes in the decoy structure in fisa, distance checks between consecutive Ca are not part of checks in these methods, and they failed to detect the consecutive Ca atoms anomaly in the fisa decoy set.
Thus, we propose a simple and fast discriminator for protein structure quality based on the distance profiles of consecutive backbone Ca atoms that identifies decoy structures that are physically nonviable.

Results and discussion
The frequency distribution of the distance of consecutive Ca atoms in ~100 proteins in the top100H database (a database consisting of high quality structures) 27 shows that the distance between consecutive Ca atoms are distributed normally with a mean of 3.8 Å and standard deviation of 0.04 Å (Figure 1a). Out of 16,162 pairs of consecutive Ca atom distances, 14,281 (88%) were spaced 3.8 Å apart, 1297 (8%) were spaced 3.9 Å apart and 553 (3%) were spaced 3.7 Å apart. Only 31 (0.1%) pairs of consecutive Ca atom distances had values different than these (highest being 4 Å and the lowest being 2.9 Å). It would be interesting to correlate these distance deviant residue pairs to structural or functional aspects of the protein -It is well worth examining every outlier and either correcting it if possible, giving up gracefully if it really cant be improved (more

Amendments from Version 2
We have implemented the following changes to add data in accordance with comments from Dr Rafael Najmanovich: 1). We obtained a set of PDB structures from the PISCES database (http://dunbrack.fccc.edu/PISCES.php) -they have a precompiled set of structures below a certain resolution and with a certain homology cut-off.
2). We have binned the structures based on resolution into different sets.
3). We have plotted the frequency distribution of the RDCC for each of these sets and display them in a new figure (Figure 1e).
Incidentally, and as expected, we could not detect any correlation based on RDCC and the resolution of the protein structure. often true at low resolution), or celebrating the significance of why it is being held in an unfavorable conformation 33 .
The cis confirmations of peptide bonds are mostly responsible for these deviations. For example, in the protein concanavalin B (PDBID:1CNV), there are four violation of the 3.8 Å constraint: Ile33/Ser34 -4 Å, Ser34/Phe35 -3 Å, Pro56/Ser57 -4 Å and Trp265/ Asn266 -3.4 Å. These all these deviations are noted in the PDB file as footnotes, mentioning that 'peptide bond deviates significantly from trans conformation' 34 . Another example is the Glu223-Asp24 violation in PBDid:1ADS, which is between two cis prolines (as noted in the PDB file) 35 . However, these conformations are rare and not expected to occur frequently in a protein structure.
Figure 1b plots the root-mean-square deviation of the distance of consecutive Ca (RDCC) for these ~100 proteins. All structures in the top100H database have low RDCC values, barring three proteins (PDBids: 2ER7, 1XSO and 4PTP), which had multiple conformations for some residues, and were excluded from the processing. This validates our hypothesis that RDCC is minimized in RDCC for the hg_structral and fisa decoy sets from the Decoys 'R' Us database. All 500 decoy structures in each protein structure in the fisa decoy set are discriminated by applying the RDCC criterion. Figure 3 shows the superimposition of the native structure and the first decoy structure (AXPROA00-MIN) for a protein (PDBid:1FC2) taken from the fisa decoy set. The distance between Ile12/Ca and Leu13/Ca atoms is 3.8 Å and 4.1 Å in the native and the decoy structures, respectively. According to our hypothesis, a 4.1 Å distance between consecutive Ca atoms is typically unfeasible in protein structures, and their occurrence should be relatively rare.
The presence of such deviations throughout the protein structures categorizes it as a non-native structure. MolProbity 30 and ProSA 31 are two programs used as a pre-processing step for structures used in CASP 38 . MolProbity was able to discriminate the decoy structure (AXPROA00-MIN) from the native structure (PDBid:1FC2) using a metric called the ClashScore (the number of serious steric overlaps) and the Cb deviations 39 . PROSA was unable to discriminate between the decoy and the native structures, reporting equivalent Zscores of -4.12 and -5.28, respectively. The WHATIF server report also reports steric clashes in the decoy structures (Data File 1). None of the above mentioned methods use a metric similar to the RDCC proposed in this paper, and thus did not report the abnormal distance between consecutive Ca atoms in the decoy structure.
The hg_structal and misfold decoy sets are indistinguishable using this distance discriminator, while only a few decoy structures failed in the 4state_reduced decoy set. This relationship between RDCC and proteins structure quality is therefore not an equivalence relationship.
In propositional calculus, a relationship is equivalent if 'A' implies 'B' and 'B' implies 'A'. A high RDCC implies a low quality structure, but a low quality structure does not necessitate a high RDCC. We therefore suggest the usage of the RDCC measure as a first pass to rule out the non-native contacts prior to applying other discriminators.
The model quality assessment program (MQAP) used to choose the best structure from the multiple closely related structures generated native structures. Hence, structures that have a RDCC value more than a user specified threshold can be pruned out as structures with low quality or non-native structures.
We evaluated the results using the measures of specificity (the ability of a test to identify negative results) which is defined as: (TN = true negatives, FP = false positives). The specificity variation with the cutoff chosen is shown in Figure 1c. We choose 0.012 Å as the cutoff value for RDCC, which has a specificity of 1. We also plot the RDCC of the 121 testcases (Figure 1d) Table 1 shows the mean and standard deviation for these sets, and demonstrates that the RDCC values are independent of the resolution of the structure under consideration.
We have applied this cutoff on decoy sets from the Decoys 'R' Us database 28 . The first protein (the native structure) in all decoy sets has RDCC below the 0.012 Å cutoff (Figure 1c). Figure 2 shows the The hg_structal and misfold decoy sets are indistinguishable using the distance discriminator, unlike the fisa decoy set. We have shown ~25 decoy structures from the fisa set, but the values apply to all the decoys (more than 500). The first protein (the native structure) in each set has RDCC below the 0.012 Å cutoff.

Table 1. Mean and standard deviation (SD) of RDCC values for structures based on resolution.
The number N signifies the number of protein structures analzyed that have resolution less than the specified number, but more than the previous one. For example, there are 165 protein with less than 1 Å resolution, and 682 proteins which have more than 1 Å but less than 1.5 Å resolution, and so on. this criterion. It has been previously shown that the fisa decoy set violates the van der Waals term 57 . We propose a fast complementary method to identify this transgression. It is also an interesting fact that most consensus methods will fare poorly in the fisa decoy set, since the majority of sub-structures are incorrect in all the decoy sets. Therefore, the fisa decoy set consists of physically nonviable structures and one should exercise caution when benchmarking other methods using this decoy set 58 .

Materials and methods
The set of proteins Φ proteins consists of the native structure P 1 and M-1 decoys structures (Equation 2). We ignore the first x=IgnoreNTerm and last y=IgnoreCTerm pairs of residues in the protein structure to exclude the terminals (Equation 3). For every consecutive pair of residues in the structure we calculate the distance between the consecutive Ca atoms (Res n (Ca) and Res n+1 (Ca)), and its deviation from the ideal value of 3.8 Å. The square of the summation of these deviations is then normalized based on the number of pairs processed, and results in the CADistScore. We hypothesize that CADistScore P1 is minimum in a native structure (Equation 4). Algorithm 1 shows the pseudocode for the function that generates the CADistScore.
by structure prediction programs is of critical importance. We have in the past used electrostatic congruence to detect a promiscuous serine protease scaffold in alkaline phosphatases 40 and a phosphoinositidespecific phospholipase C from Bacillus cereus 41 , and a scaffold recognizing a b-lactam (imipenem) in a cold-active Vibrio alkaline phosphatase 42,43 . However, continuum models 44 that compute potential differences and pK a values from charge interactions in proteins 45 are sensitive to the spatial arrangement of the atoms in the structure.
Thus, an incorrect model will generate an inaccurate electrostatic profile of the peptide 46 . It is thus possible to functionally characterize a protein from its sequence by applying such in silico tools subsequent to the protein structure prediction and MQAPs tools 47 .
The estimation of the model quality by MQAPs is achieved by formalizing a scoring function 48 , referred to as a knowledge-based or statistical potential, constructed from the database of known structures, assuming that the distribution of the structural features obtained from these structures follows the Boltzmann distribution 20,23,24,26 . The validity of statistical potential and the method to choose a proper reference frame in such models are still widely debated 21,22 . Methods that use consensus values from numerous models outperform other MQAP methods 11-14 , and are 'very useful for structural metapredictors' 49 . It has been shown that many of the MQAP programs perform considerably better when different statistical metrics are combined 50-53 . The state of the art methods for predicting structures 54 and MQAPs 38,49,55 are evaluated by researchers every two years.
Here, we propose a discriminator (RDCC) based on the distance of consecutive Ca atoms in the peptide structure. The discriminator is independent of the database of structures 56 , and is thus an absolute discriminator. Our proposed RDCC criterion is satisfied in high quality protein structures taken from the top100H database. As a specific application, we show that all decoy structures in the fisa decoy set from the Decoys 'R' Us database Ca atoms do not satisfy In order to validate our hypothesis on known structures, we applied our discriminator to the top100H database (a database consisting of high quality structures) 27 -http://kinemage.biochem.duke.edu/ databases/top100.php. In order to benchmark model quality assessment programs, we used decoy sets from the Decoys 'R' Us database 28 -http://dd.compbio.washington.edu/. Each set has several structures that are supposed to be ranked worse than the native structure.
The source code and manual is made available at https://github. com/sanchak/mqap.

Author contributions
Conceived and performed the experiments: SC. Analyzed the data, and improved experiments: SC BA AMD BJR RV. Wrote the manuscript: SC BA AMD BJR RV.

Competing interests
No competing interests were disclosed. This is a nice, straightforward analysis of features of protein decoy sets. The authors find that a simple, novel measure of protein geometry is sufficient to distinguish native structures from decoys in the popular fisa decoy set. A rational response to this work would be to include measures of geometry such as that fisa decoy set. A rational response to this work would be to include measures of geometry such as that which are used here in the generation of the decoys. This would make the measure less useful, but would improve decoy sets.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. This paper proposes an absolute discriminator to identify non-native protein structures based on backbone C atom distance. Interestingly the authors found that most methods performed poorly in the fisa decoy set from the Decoys "R" Us database and remind researchers to be cautious of using this decoy set since most of the sub-structures in the fisa decoy set are incorrect. This work provides a simple and fast assessment for protein structure quality.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Thank you for taking the time to review our manuscript. We appreciate your encouraging comments. We have revised our manuscript with some additional results from the CASP8 I-TASSER decoy set, and also cited some recent manuscripts which had been published since the version 1 went online. We hope you find the revised version improved.

Sandeep
No competing interests were disclosed. Competing Interests: