A hydrophobic proclivity index for protein alignments

David Cavanaugh; Krishnan Chittur

doi:10.12688/f1000research.6348.1

Home Browse A hydrophobic proclivity index for protein alignments

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

A hydrophobic proclivity index for protein alignments

[version 1; peer review: 1 approved with reservations, 1 not approved]

David Cavanaugh¹, Krishnan Chittur²

PUBLISHED 21 Oct 2015

Author details Author details

¹ Benchmark Electronics, Huntsville, AL, USA
² Chemical Engineering Department,, University of Alabama Huntsville, Huntsville, AL, USA

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Sequence alignment algorithms are fundamental to modern bioinformatics. Sequence alignments are widely used in diverse applications such as phylogenetic analysis, database searches for related sequences to aid identification of unknown protein domain structures and classification of proteins and protein domains. Additionally, alignment algorithms are integral to the location of related proteins to secure understanding of unknown protein functions, to suggest the folded structure of proteins of unknown structure from location of homologous proteins and/or by locating homologous domains of known 3D structure. For proteins, alignment algorithms depend on information about amino acid substitutions that allows for matching sequences that are similar, but not exact. When primary sequence percent identity falls below about 25%, algorithms often fail to identify proteins that may have similar 3D structure. We have created a hydrophobicity scale and a matching dynamic programming algorithm called TMATCH (unpublished report) that is able to match proteins with remote homologs with similar secondary/tertiary structure, even with very low primary sequence matches. In this paper, we describe how we arrived at the hydrophobic scale, how it provides much more information than percent identity matches and some of the implications for better alignments and understanding protein structure.

Keywords

Sequence alignment algorithms, hydrophobicity scale, protein homologs, TMATCH

Corresponding author: Krishnan Chittur

Competing interests: We declare that there are no competing interests for DC or KKC that have influenced the content of this article.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2015 Cavanaugh D and Chittur K. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Cavanaugh D and Chittur K. A hydrophobic proclivity index for protein alignments [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2015, 4:1097 (https://doi.org/10.12688/f1000research.6348.1) First published: 21 Oct 2015, 4:1097 (https://doi.org/10.12688/f1000research.6348.1) Latest published: 15 Oct 2020, 4:1097 (https://doi.org/10.12688/f1000research.6348.2)

Introduction

An understanding of the properties and functions of a protein or a nucleic acid often begins with a search of the sequence against databases of proteins (or nucleic acids) with known properties or functions. The fundamental assumption is that sequence leads to structure which in turn leads to an understanding of the function. Search algorithms have improved and continue to improve. Yet, with proteins in particular, it remains difficult to detect remote homologies in the so called twilight zone where proteins have low percent sequence identities starting around 20–25% and descending to around 10–15%. We describe a hydrophobicity scale that is proving to be an excellent measure of sequence relatedness. A robust estimate of the hydrophobicity based sequence identity can be calculated directly from a global alignment score, which may be directly used in database searches. Proteins with low sequence identities, possessing statistically insignificant similarities by conventional measures, but having similar secondary/tertiary structures, which would not be identified as statistically significant by other methods such as FASTA and Smith-Waterman can be identified as homologous using our new alignment algorithm (unpublished report) through the enhanced information content of our hydrophobicity proclivity scale.

Approach

Hydrophobicity scales (metrics) as understood in the literature are generally divided into four categories, derived from

Experimental physio-chemical data
Log of a partition coefficient derived from protein structure (e.g. Fraction amino-acids inside vs. outside, fraction amino-acids in contact with water vs. completely buried, etc.)
Amino-acid mutation/substitution rates and
Participation rates/probabilities of occurrence in folded protein secondary structure

There are a large number and myriad types of scales that appear in the literature starting from the 1960’s through to the present with a fair amount of variation amongst these scales. The correlation between some of the hydrophobicity scales can be best understood as that derived from the energy of interaction between amino-acids and water or the energetics of partition of amino-acids from water as the reference state and some other environment such as a non-polar solvent or the interior of a folded protein. Hydrophobicity can thus be joined within a single, unified, conceptual framework^1,2 Through extensive analysis (primarily using regression and scatter plots), we were able to identify patterns and arrive at metrics describing amino acid properties. We derived a number of additional metrics by differentiating metrics that were intrinsic as opposed to extrinsic, as understood in thermodynamics. Extensive cross correlation with the primary and derived metrics using regression modelling were undertaken to recover the best and most meaningful hydrophobicity metrics. We relied on several different sources for our analysis. For data on amino acid surface areas, we used Rose et al.³. Amino acid mass information was obtained using the AAINDEX accession number #FASG760101^4,5. Amino acid volume data was obtained from Creighton⁶ Amino acid absolute entropy of formation was from the AAINDEX database using accession number #HUTJ700102^4,5.

Methods

We arrived at our hydrophobicity scale after exhaustive analysis which included numerous scatter plots and the running of a number of multiple regressions. The question we were trying to answer was - What was the best hydrophobicity scale, or combination of scales, that best represented the role of the different amino acids in proteins?

We started by first collecting many hydrophobicity indices and physico-chemical indices from the literature and scatter plotted/regressed the hydrophobicity indices against each other, and the harvested physico-chemical properties and their derived intrinsic properties of amino acids. For example when a hydrophobic scale is plotted against the ratio of the surface area per specific volume (volume/molecular weight) of each amino acid we get a scatter plot with a distinct pattern. In such a scatter plot, we can identify one or more sets of linear clusters of amino acids, each set of which is considered to be a “property class”.

Consider Figure 1 where our normalized average hydrophobicity index is scatter plotted against the area per specific volume of each amino acid (shown using their alphabetical representations).

Figure 1. Hydrophobic Proclivities versus Area per specific volume of amino acids.

We can clearly see cross-hatched patterns where for example the amino acids G, A, C, V, I and L are on a straight line (starting from the top left to bottom right). Moving right, we see that S, P, T, M and F are on a straight line (nearly parallel to the line formed by G,A,C,V and I). Continuing further right, we see a third line which crosses several amino acid, followed by an outlier, amino acid R. This series of four lines form what we call Property Class 1. We assign a numerical value of 0 to the line through G,A,C,V and I and a value of 1 to the next line and so on. In the same Figure 1 we can see the formation of Property Class 2 which contains only two series. We arrived at Property Class 3 and Property Class 4 by scatter plotting our normalized average hydrophobicity index against specific absolute entropy (and this is shown in Figure 2) The four property classes we identified respectively in the scatter plots shown as Figure 1 and Figure 2, along with the respective X axes physico-chemical property, correlated very highly (as multiple linear regression factors) with our normalized average of three robust hydrophobic indices’s (shown as ave3H) having an R squared >95%.

Figure 2. Hydrophobic Proclivities versus specific absolute entropy.

Property class #5 reflects a scatter plot between the delta G of burial of AA secondary groups⁷ (as Y) and the number of atoms in the respective secondary group⁶, which resulted in 5 linear series. Each of the linear series numbers (0 through 4) for each AA forms the basis of property class #5. The multiple linear regression of the delta G of secondary group burial with number of secondary group atoms and property class #5 resulted in an R² of 98.1%. Property classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA)⁸. Together PC #1 to #8 represents 8 X vectors (listed in Table 3) in the multiple linear regression reported in the third column of Table 2. The property class index vectors are shown in Table 3.

We were able to find three hydrophobicity scales that were the most robust from the regression cross correlation study. The hydrophobicity proclivitity scale that we report in the present paper are the normalized average of three normalized scales^2,9,10.

Our hydrophobic index is the result of an extensive mining of the literature about proteins and amino acid scales/metrics in different environments. Almost all hydrophobicity scales reflect in some way a measure of the energetics of transfer of an amino-acid (or proteins) from one solvent environment (water) to another (folded protein or multiple protein assembly). During our data mining and analysis, three hydrophobicity metrics emerged as the most appropriate since we could relate those scales to multiple fundamental properties of the 20 natural amino acids using multi-variate statistical procedures, thermodynamics and biophysical chemistry considerations^2,9,10. Hydrophobicity scales reflect different physical properties of amino-acids, such as metrics derived from amino-acid partitioning patterns (e.g. from the hydrophobic core to the exterior of proteins) or log of partition ratios between water and organic solvents. We found, as widely suggested in the literature, that the free energy of transfer from water to octanol turns out to be a good proxy for the hydrophobic core environment of folded proteins.

We created a normalized average of the three key hydrophobicity scales (The index i=1 is from Tang², index i=2 is from Neumaier⁹ and the index i=3 is from the average of the collected scales in Juretic¹⁰). This normalized average of three scales provides a reasonably unbiased estimate of the "true" average hydrophobicity relationship amongst the 20 amino-acids (index j, from 1 to 20)

H n (i, j) = [\frac{H_{i j} - \min (H_{i j})}{\max (H_{i j}) - \min (H_{i j})}] (1)

H b (i, j) = \frac{H_{1 j} + H_{2 j} + H_{3 j}}{3.0} (2)

The hydrophobicity scale as calculated using Equation 2 using the scales published in^2,9,10 has a number of interesting relationships with key physico-chemical properties of the amino-acids in proteins. For example, this normalized average of these three best hydrophobicity metrics possesses statistically significant linear correlation with many other reliable hydrophobicity metrics derived from multiple literature hydrophobicity scales.

An example scale, derived from an analysis of 28 literature hydrophobicity metrics, possesses a strong linear relationship (R² = 0.959) with our normalized average of three hydrophobicity scales, that forms a hydrophobicity proclivity scale, has been published in 1.

Results

Hydrophobicity scales are typically derived from a measure of the probability that a particular residue will be buried in the core of the protein, away from water. What confounds these calculations is the fact that in most proteins, many of the hydrophobic residues are still exposed to the water (solvent). It is often not clear on how to treat residues that have properties intermediate between hard core hydrophobic and polar residues. The size of the residues and difference between alkyl and aromatic residues also pose some difficulty in the calculation of a hydrophobicity scale. Calculations involving cysteine residues add additional complexity in that some of those residues may be involved in providing proteins structural stability through formation of disulfide bonds. Thus, calculation of contributions to any hydrophobicity index through analysis of where specific residues are in a given protein has been complicated and contributed to the scatter we see in the data. We demonstrate this by examining the normalized average of several popular hydrophobicity scales^11–16 versus the probability of an amino-acid solvent-exposed area (SEA)^17,18 greater that 30 (shown in Figure 3)

Figure 3. Normalized average of several hydrophobic scales with Solvent Exposed area.

Figure 3 shows that there is indeed a relationship between the hydrophobicity scale and whether or not a particular amino acid is within a protein core or exposed on the surface. We see one tight grouping of amino acids in the figure (I, F, V, L, M, W, A and G) and two loose groupings that include P, T, S, Y, H and N, Q, E, D, K and R. The group at the top right (N, Q, E, D, K and R) include amino acids that are ionic/strongly polar and the central group of amino acids are of intermediate polarity. The tight group of amino acids are primarily amino acids with hydrophobic residues. As we go from the very hydrophobic group to the less hydrophobic group (from the lower left to the top right) the scatter goes up. This scatter is indicative of the increase in water amino acid interaction and of the difficulty of accurately calculating the contribution of any particular residue.

In Figure 4 we show a scatter plot of our amino-acid hydrophobicity proclivities against the popular Fauchere & Pliska free energy of amino-acid transfer from n-Octanol to water (Gtow) scale^7,19. It is common in the literature to see n-Octanol used as a proxy for the typical hydrophobic core of folded globular proteins, consequently the Gtow scale has been widely used as a measure of hydrophobicity. As can be seen above the correlation is quite good at 85.9% linearity (coefficient of determination). The regression of these two scales is used to derive a fitted free energy of transfer and reported in Table 1 and used in our new alignment algorithm. Since Gtow reflects a delta G (energy) of transfer, hydrophobic proclivities can also be seen to relate directly to energy.

Figure 4. Hydrophobic Proclivities versus Structure F & P Gtow.

Table 1. Table of Regression Fitted Hydrophobic Proclivities.

Residue Amino Acid	Hydrophobicity (H)	Regression Fitted ΔG
F (Phenylalanine)	0.0688	2.5658
L (Leucine)	0.0579	2.6095
I (Isoleucine)	0.0349	2.7022
M (Methionine)	0.2213	1.9528
V (Valine)	0.1427	2.2687
P (Proline)	0.7123	-0.0212
T (Threonine)	0.6599	0.1895
S (Serine)	0.7074	-0.0018
A (Alanine)	0.4925	0.8624
Y (TYrosine)	0.4523	1.0237
H (Histidine)	0.6763	0.1232
Q (Glutamine)	0.8692	-0.6522
N (AsparagiNe)	0.8350	-0.5148
K (Lysine)	0.9651	-1.0376
D (Aspartic AciD)	0.9157	-0.8393
E (Glutamic Acid)	0.8974	-0.7657
C (Cysteine)	0.2650	1.7769
W (Tryptophan)	0.3403	1.4742
R (ARginine)	0.9091	-0.8126
G (Glycine)	0.6582	0.1961

The reasonableness of our hydrophobicity scale is also demonstrated by examining the relationship between our scale and the mean residue depth (dpx) defined as the distance between the interior of a protein amino-acid and the nearest water molecule in the aqueous shell surrounding the protein^20,21. In Figure 5 we show that there is a strong relationship (97% linearity) between the dpx metric and our hydrophobic proclivities. The dpx metric is a straight forward geometrical description of the local protein interior and can be expected to provide similar information to the solvent accessible area and buried surface area metrics. The dpx depth and hydrophobic proclivities correlate with amino-acid/protein properties such as average protein domain size, secondary structure, protein stability, free energy of formation of protein complexes, major literature amino-acid hydrophobicity scales, residue conservation, post-translational modifications like phosphorylation, and hydrogen/deuterium amide proton exchange rates^7,20,21.

Figure 5. Hydrophobic Proclivities versus Structure based mean residue depth.

In Table 2 we summarize the performance of several of the hydrophobicity scales published in the literature. The hydrophobicity scales shown as rows are compared with four important quality metrics that are either amino acid physico-chemical properties or derived from such properties. The quality of inter scale regressions are shown as R². The performance of each row scale can be observed relative to the other row scales within each of the four columns, where the higher the R² the better the performance of the row scale with regards to the column scale. There are 13 rows in Table 2 representing 11 hydrophobicity scales, one solvent exposed area scale and one delta G of transfer from water to an organic solvent (Octanol).

Table 2. Linear correlation between hydrophobicity scales and AA physico-chemical properties.

H Scale	Moelbert ASA R²	F & P C₈OH R²	8 AA Property Class R²	6 factor R²
Chothia	44.1%	58.6%	86.0%	88.6%
Kyte-Doolittle	61.9%	65.7%	97.6%	94.8%
Jannin	56.2%	68.3%	84.1%	80.7%
Juretic Avg	63.7%	69.2%	97.9%	94.9%
SEA >30	70.7%	72.7%	92.5%	89.4%
Engleman-Steitz	53.0%	72.8%	78.3%	87.4%
Eisenberg-Weiss	56.4%	76.1%	86.4%	71.3%
Rose Avg% buried	86.1%	81.7%	88.1%	86.8%
Hopp-Woods	71.7%	82.7%	69.2%	71.3%
Tang Q	86.0%	84.6%	96.7%	91.3%
avg 3H	84.7%	85.9%	99.3%	95.8%
Neumeirer X	90.2%	89.1%	97.6%	94.2%
F & P del G C₈OH	85.3%	100.0%	94.1%	89.6%

Of the 11 hydrophobicity scales in Table 2, 7 are popular scales in practice, three are the constituent scales of our hydrophobicity proclivity scale and our hydrophobicity proclivity scale. These row choices in Table 2 are to illustrate a close relationship between AA hydrophobicity and the transfer of an amino acid to an organic solvent (n-Octanol, column 2), used as a proxy for the internal environment of a folded protein, as well as to compare AA hydrophobicity with an AA Solvent Exposed Area scale (column 1) also representing a folded protein environment. The high R² between the row dG of transfer to Octanol and the first column AA Solvent Exposed Area (SEA) scale in Table 2 illustrates the aptness of comparing the dG of AA burial in protein "solvent" to a solvent-solvent transfer model between water as the reference state and an organic solvent as the transfered or final state. In Table 2, the inclusion of the row SEA is to illustrate the high R² with the first column SEA illustrating the consistency of folded protein behaviour in SEA scales derived from different data sets. With the Rose AA percent buried row hydrophobicity scale³, simlar lessons can be gleaned as with the row Octanol and SEA scales, as the Rose scale represents the environment of a folded protein. The very high R² between these three row scales and the last two column regression scales in Table 2 illustrate a strong justification for including these row scales, as protein folding is thereby strongly linked with other physico-chemical properties of amino-acids, as reflected by these two columns. We describe the regression X variables in the 4 columns of Table 2) below.

We can see that the correlation between our hydrophobicity scale (shown as avg 3H in Table 2) and the Moelbert average amino-acid solvent Accessible Surface Area (ASA) within proteins has an R² = 84.7%. The ASA is the average area of each amino acid exposed to water in the globular proteins. When our hydrophobicity proclivity scale approaches 1 (i.e. hydrophilic) the ASA goes up as would be expected, with the converse being true as our hydrophobicity scale approaches 0 (i.e. hydrophobic) the ASA goes down²².

The amino-acid Accessible Surface Area (ASA) has long been suggested as a reasonably accurate proxy for hydrophobicity^7,17,22 as is also seen in a related scale, the Solvent Exposed Area > 30 square angstroms^17,18. The amino-acid property classes are vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties. The first two columns (ASA and Gtow) represent paired variable linear regressions and the third column (Property Classes #1 to #8) and fourth column (Property Class #1 to #4, AA area/specific volume⁶ and specific absolute entropy^4,5) represent multiple linear regressions.

The R² in the first two columns of Table 2 represent linear regression results between the Y (row) vectors and the X (column) vectors. The R² in the last two columns of Table 2 derive from multiple linear regressions, where the independent (X) variables are vectors of amino-acid property Classes (PC) and/or amino-acid physico-chemical properties, and each row parameter is the dependent variable, respectively. Again, the Property Classes can be thought of as distinct subsets of amino-acids representing multiple linear series/clusters (within scatter plots or multiple linear regressions) of amino-acids in reference regressions associated with X variable vectors from some key physico-chemical metrics plotted against the hydrophobicity proclivity vector scale.

In Table 2, we see that the F and P Gtow scale performs as well (i.e. high R²) as the best of the hydrophobicity scales within columns 1, 2 and 4, thus, further justifying our selection of the Gtow scale as our baseline standard for a free energy of transfer from an aqueous solvent environment to a non-aqueous solvent. The SEA > 30 A² does as well as the popular hydrophobicity scales in Table 2 and has good correlation with the F and P Gtow scale in column two and thereby establishes a direct link between the F and P Gtow scale and the free energy of burial of amino-acids in proteins and providing strong evidence justifying a solvent-solvent transfer model for protein folding.

The Tang Q and Neumeier X scales are the top performing individual hydrophobicity scales as seen in the first two column results, followed on average by the Rose scale. The Juretic Avg scale generally performs as well as the five popular hydrophobicity scales in columns one and two, but more importantly it performs better than any other single hydrophobicity scale except for the Tang Q and Neumeirer X scales in columns three and four. Since we consider columns three and four to be a more rigorous test for a robust, high performance hydrophobicity scale, we see the justification for selecting the Tang Q, Neumeirer X and Juretic Avg as the scales from which to prepare our hydrophobicity proclivity (3H) scale. Our hydrophobicity proclivity scale performs basically as well as the best individual hydrophobicity scales in columns one and two, but it is the top performer in columns three and four. No other hydrophobicity scale that we evaluated on average performed as well (i.e. magnitude of R²) in regression comparisons with amino-acid physico-chemical properties as our hydrophobicity proclivity scale.

In Table 2 column three is the 8 sets of numbers (vectors), dubbed as property classes and are eight X vectors in the multiple linear regression relationships with the R² shown in the third column. These eight property class vectors can form multi-linear regression fits with very high R² with a large number of the physico-chemical properties of the of the 20 amino-acids in our accumulated AA physico-chemical property database, thereby serving as proxy’s for these properties. In Table 2 column four, we see four property class vectors (#1-#4) and two AA physico-chemical property vector scales (surface area/specific volume, specific absolute entropy); column four is included to illustrate the method of construction of the eight Property Class (PC) vectors represented by column 3.

Discussion

The great organizing principle embodied within the hydrophobicity proclivities (and implied by dpx), is that of a neo solvent-solvent partitioning effect, where the energetics of the solvent shell waters are the dominant effect in the energy balance. As with clathrates (ordered aqueous shells), which form spontaneously with hydrophobic molecules, there is a solvent shell of ordered waters that form spontaneously around solvated globular proteins. However, there is a confounding factor in trying to obtain an accurate hydrophobicity proclivity in that even the most hydrophobic residue will have some average solvent exposed area, so it is reasonable to postulate that there is some functional reason for exposure of some grease to the solvent The presence of hydrophobic surface area causes an aqueous clathrate shell to form at that point perhaps effectively becoming part of the folded structure of the folded protein, possibly as a retaining structural element operating through surface tension and putting the interior of the globular protein under pressure.

The importance of amino-acid hydrophobicity to the structure and function of globular proteins is critical to the function and survival of cells, a reality that is even reflected in the very structure of the standard genetic code. The amino-acid codons are arranged/coded in such a way as to reflect the underlying hydrophobicity of the respective amino-acids. A careful analysis reveals that the genetic code has a built in redundancy through amino-acid hydrophobicity (in addition to codon redundancy) such that point mutations in a codon that yield a different codon tend to result in an amino-acid with similar hydrophobicity. It has been shown that the underlying amino-acid codon structure has a direct relationship with high quality hydrophobicity scales that are published in the literature²³.

A legitimate question about the hydrophobic proclivity scale we have described is why our scale is superior to alignment score matrices such as PAM (Point Accepted Mutation)²⁴, BLOSUM (BLOck SUbstitution Matrix)²⁵ or Gonnet²⁶ that continue to be used for multiple protein alignments and database search alignments. There are indeed several practical and theoretical problems with the use of these log odds score matrices for the alignment of divergent protein sequences. For example, BLAST and several of the major multi-sequence alignment programs like Clustal W use particular BLOSUM matrices as the default. BLAST uses BLOSUM62 as the default. Quotes from select papers have been summarized below to more clearly illustrate these problems.

The substitution matrices used by the alignment programs are generally log of Bayesian probabilities for two amino-acids I and J of the form:

$Q_{i j} = p r o b (A / B) = \frac{p r o b (I - > J)}{p r o b (I a n d J)} = \frac{p r o b (I - > J)}{(p r o b (I) * p r o b (J))}$

The probability of occurrence of the 20 primary aminoacids is not the same throughout the domain/kingdoms of life, so this mathematical formulation can cause issues for identifying and aligning homologous proteins.

Superimposed on the log of Bayesian probabilities formalism are evolutionary models derived from Markov stochastic process evolutionary models (PAM), which implies apriori knowledge of the evolutionary amino-acid substitution rates. Necessarily, if one chooses PAM or BLOSUM, one must choose one of the series of matrices that one believes is appropriate for the approximate evolutionary distance between any two protein sequences under analysis. Obviously, this practice can cause an undue restriction if the evolutionary distance is too great within the protein dataset being aligned. The only assumption that we make with hydrophobicity and our new alignment algorithm is that nature will strongly tend to substitute similar amino-acids in order to preserve the overall function and structure of homologous proteins, and that it is possible to define a hydrophobicity distance to define a fuzzy match between any two amino-acids, which is recognized as a “similarity match.”

Table 3. Property Class Index Vectors #1 - #8.

Residue	PC 1	PC 2	PC 3	PC 4	PC 5	PC 6	PC 7	PC 8
A	0	1	1	2	2	1	2	2
C	0	0	2	2	4	1	2	2
D	2	1	1	3	0	1	1	0
E	2	1	1	3	1	0	0	0
F	1	0	0	0	3	1	2	1
G	0	1	2	3	1	1	2	1
H	2	1	0	2	2	1	2	1
I	0	0	1	0	4	0	2	3
K	2	1	0	3	2	1	1	0
L	0	0	2	1	4	1	3	3
M	1	0	1	1	3	1	2	2
N	2	1	1	3	0	1	1	0
P	1	1	1	3	3	0	1	1
Q	2	1	1	3	1	0	0	0
R	3	1	0	3	1	1	1	0
S	1	1	2	3	1	1	2	1
T	1	1	1	3	2	0	1	1
V	0	0	1	1	3	0	2	3
W	2	0	0	1	3	1	2	2
Y	2	0	1	2	2	0	1	1

We summarize the salient points regarding alignment matrices with quotes from four select literature articles below.

1. “The most common substitution matrices currently used (BLOSUM and PAM) are based on protein sequences with average amino acid distributions, thus they do not represent a fully accurate substitution model for proteins characterized by a biased amino acid composition”²⁷
2. “We have investigated patterns of amino acid substitution among homologous sequences from the three Domains of life and our results show that no single amino acid matrix is optimal for any of the datasets”²⁸
3. “Many phylogenetic inference methods are based on Markov models of sequence evolution. These are usually expressed in terms of a matrix (Q) of instantaneous rates of change but some models of amino acid replacement, most notably the PAM model of Dayhoff and colleagues, were originally published only in terms of time-dependent probability matrices (P(t)). Previously published methods for deriving Q have used eigen-decomposition of an approximation to P(t). We show that the commonly used value of t is too large to ensure convergence of the estimates of elements of Q. We describe two simpler alternative methods for deriving Q from information such as that published by Dayhoff and colleagues.”²⁹
4. These authors note another interesting problem with the residue substitutions rates use in the Q matrix: “Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces.”³⁰

Tomii et al.⁵ essentially conclude that in the “evolutionary” limit, alignment/mutation matricies reflect the hydrophobicity and amino-acid secondary group size. For example, when the correlation coefficient between a hydrophobicity scale and a amino-acid secondary group size, and the PAM matricies are plotted against the PAM distance, the correlation coefficient monotonically increases from 0.58 at a PAM near zero, to a PAM distance of 200 where the correlation coefficient reaches an asymtotic limit of about 0.73⁵.

Conclusion

The amount of information available to an alignment algorithm is essential to its ability to find matching proteins, especially matches with remote homologies where the percentage identity has dropped off to around 20–25%. In this study we have sought to find an optimalhydrophobicity scale that would reflect the real properties of amino-acids within the context of folded proteins. We contend that hydrophobic proclivities transcend mere statistical trends and reflect the functional necessities of globular proteins by amino acid properties according to a solvent-solvent (water → interior of a folded protein) partitioning model. Within this model the primary driving force is that of water-water attractions that exceed water-amino acid attractions. Hydrophobicity is not a force that repels amino acids from water, but rather that water molecules attract each other more. When hydrophobic amino acids are exposed to water, clathrate shells spontaneously form at those areas, creating an anchored aqueous patch of ordered water molecules with surface tension. Thus, the preferred hydrophobicity scale of hydrophobic proclivities as we have described here provides significant new information to alignment algorithms and in particular our TMATCH algorithm (described elsewhere), optimized to work with our hydrophobicity proclivity scale.

Author contributions

DC arrived at the hydrophobicity index several years ago after an exhaustive look at the literature and through extensive regression analysis of several published values of amino acid properties in proteins and how they may contribute to the structure of proteins in solution. This paper is the result of a long collaborative effort with KC whose interests were also in understanding protein structure and search algorithms in bioinformatics.

Competing interests

We declare that there are no competing interests for DC or KKC that have influenced the content of this article.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Acknowledgements

We (DC and KKC) appreciate our discussions about proteins and their solution structure with Dr. Joseph Ng (Department of Biological Sciences) and Dr. John Shriver (Department of Chemistry) of University of Alabama in Huntsville.

Faculty Opinions recommended

References

1. Cornette JL, Cease KB, Margalit H, et al.: Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol. 1987; 195(3): 659–685. PubMed Abstract | Publisher Full Text
2. Li H, Tang C, Wingreen NS: Nature of driving force for protein folding: A result from analyzing the statistical potential. Phys Rev Lett. 1997; 79: 765–768. Publisher Full Text
3. Rose GD, Geselowitz AR, Lesser GJ: Hydrophobicity of amino acid residues in globular proteins. Science. 1985; 229(4716): 834–838. PubMed Abstract | Publisher Full Text
4. Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res. 1999; 27(1): 368–369. PubMed Abstract | Publisher Full Text | Free Full Text
5. Tomii K, Kanehisa M: Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 1996; 9(1): 27–36. PubMed Abstract | Publisher Full Text
6. Creighton TE: Proteins: Structure and Molecular Properties. WH Freeman and Company. 2 edition. 1993. Reference Source
7. Karplus PA: Hydrophobicity regained. Protein Sci. 1997; 6(6): 1302–1307. PubMed Abstract | Publisher Full Text | Free Full Text
8. Cavanaugh DP, Sternberg RV: Analysis of morphological groupings using anopa, a pattern recognition and multivariate statistical method: A case study involving centrarchid fishes. J Biol Syst. 2004; 12(2). Publisher Full Text
9. Neumaier A, Huyer W, Bornberg-Bauer E: Hydrophobicity analysis of amino acids. 1999. Reference Source
10. Juretic D, Jeroncic A, Zucic D: Sequence analysis of membrane proteins with the web server split. Croat Chem Acta. 1999; 72(4): 975–997. Reference Source
11. Engelman DM, Steitz TA, Goldman A: Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem. 1986; 15(1): 321–353. Publisher Full Text
12. Hopp TP, Woods KR: Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci U S A. 1981; 78(6): 3824. PubMed Abstract | Free Full Text
13. Kyte J, Doolittle R: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982; 157(1): 105–132. PubMed Abstract | Publisher Full Text
14. Eisenberg D, Weiss RM, Terwilliger CT, et al.: Hydrophobic moments and protein structure. Faraday Symp Chem Soc. 1982; 17: 109–120. Publisher Full Text
15. Janin J: Surface and inside volumes in globular proteins. Nature. 1979; 277(5696): 491–492. PubMed Abstract | Publisher Full Text
16. Chothia C: Hydrophobic bonding and accessible surface area in proteins. Nature. 1974; 248(446): 338–339. PubMed Abstract | Publisher Full Text
17. Bordo D, Argos P: Suggestions for "safe" residue substitutions in site-directed mutagensis. J Mol Biol. 1991; 217(4): 721–729. PubMed Abstract | Publisher Full Text
18. Online. Solvent accessibility. [Online Data]. Bordo Table 2: Solvent Exposed Area > 30 square angstroms calculated from data taken from 55 proteins in the Brookhaven data base, coming from 9 molecular families: globins, immunoglobins, cytochromes c, serine proteases, subtilisins, calcium binding proteins, acid proteases, toxins and virus capsid proteins. Reference Source
19. Fauchere JL, Pliska VE: Amino acid scale: Hydrophobicity scale. Eur J Med Chem. 1983; 18: 369–375. Reference Source
20. Pintar A, Carugo O, Pongor S: Atom depth in protein structure and function. Trends Biochem Sci. 2003a; 28(11): 593–7. PubMed Abstract | Publisher Full Text
21. Pintar A, Carugo O, Pongor S: Atom depth as a descriptor of the protein interior. Biophys J. 2003b; 84(4): 2553–61. PubMed Abstract | Publisher Full Text | Free Full Text
22. Susanne M, Eldon E, Chao T: Correlation between sequence hydrophobicity and surface-exposure pattern of database proteins. Protein Sci. 2004; 13(3): 752–762. PubMed Abstract | Publisher Full Text | Free Full Text
23. Trinquier G, Sanejouand YH: Which effective property of amino acids is best preserved by the genetic code? Protein Eng. 1998; 11(3): 153–169. PubMed Abstract | Publisher Full Text
24. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978; 5(3): 345–352. Reference Source
25. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22): 10915–9. PubMed Abstract | Free Full Text
26. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992; 256(5062): 1443–5. PubMed Abstract | Publisher Full Text
27. Brick K, Pizzi E: A novel series of compositionally biased substitution matrices for comparing Plasmodium proteins. BMC Bioinformatics. 2008; 9: 236. PubMed Abstract | Publisher Full Text | Free Full Text
28. Keane TM, Creevey CJ, Pentony MM, et al.: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol Biol. 2006; 6: 29. PubMed Abstract | Publisher Full Text | Free Full Text
29. Kosiol C, Goldman N: Different versions of the Dayhoff rate matrix. Mol Biol Evol. 2005; 22(2): 193–9. PubMed Abstract | Publisher Full Text
30. Tseng YY, Liang J: Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol. 2006; 23(2): 421–436. PubMed Abstract | Publisher Full Text

Comments on this article Comments (1)

Version 2

VERSION 2 PUBLISHED 15 Oct 2020

Revised

Comment

Version 1

VERSION 1 PUBLISHED 21 Oct 2015

Discussion is closed on this version, please comment on the latest version above.

Author Response 05 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

05 Oct 2020

Author Response

I am one of the authors of this paper. I wanted to update some news. The revision 2 of this paper, which reflects the reviewer comments, is now in press ... Continue reading I am one of the authors of this paper. I wanted to update some news. The revision 2 of this paper, which reflects the reviewer comments, is now in press for F1000research. There is another paper on the subject of hydrophobicity published in ChemRxiv that is the follow up paper to this one:

https://chemrxiv.org/authors/David_Cavanaugh/8853095
I am one of the authors of this paper. I wanted to update some news. The revision 2 of this paper, which reflects the reviewer comments, is now in press for F1000research. There is another paper on the subject of hydrophobicity published in ChemRxiv that is the follow up paper to this one:

https://chemrxiv.org/authors/David_Cavanaugh/8853095
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Author details Author details

¹ Benchmark Electronics, Huntsville, AL, USA
² Chemical Engineering Department,, University of Alabama Huntsville, Huntsville, AL, USA

Competing interests

We declare that there are no competing interests for DC or KKC that have influenced the content of this article.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 15 Oct 2020, 4:1097

https://doi.org/10.12688/f1000research.6348.2

version 1

Published: 21 Oct 2015, 4:1097

https://doi.org/10.12688/f1000research.6348.1

© 2015 Cavanaugh D and Chittur K. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Cavanaugh D and Chittur K. A hydrophobic proclivity index for protein alignments [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2015, 4:1097 (https://doi.org/10.12688/f1000research.6348.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 21 Oct 2015

Views

Reviewer Report 25 Jul 2016

Ana Jerončić , Department of Research in Biomedicine and Health, University of Split School of Medicine, Split, Croatia

Not Approved

https://doi.org/10.5256/f1000research.6806.r14649

The authors aimed to develop a hydophobicity scale which optimally reflects properties of amino acids residues/amino acids that are relevant for folded proteins. The long term goal was to use such a scale to estimate hydrophobicity-based protein sequence relatedness from a global alignment in order to improve identification of structural homologs with less than 25% sequence identity.

The authors have identified eight properties (so called property classes) of amino acids relevant for folded proteins by recognizing distinct patterns in a series of scatter-plots that plotted many hydrophobicity indices and other physico-chemical properties of amino acids/ amino acid residues against each other. Apparently, different property classes were visually recognized in scatter-plots after the “linear cluster of amino acids”-fingerprint was found (clusters of amino acids with similar physico-chemical properties that are aligned along distinct imaginary lines in a scatter-plot).

The final set of plots from which eight property classes were derived included: scatter-plots of the hydrophobicity scale that was developed by the authors versus a) the area per specific volume of each amino acid (property classes 1 and 2) or b) the specific absolute entropy (classes 3 and 4); c) the plot of delta G of burial of amino acid secondary group versus number of atoms in a group (class 5); and finally, ambiguously defined “classes #6, #7 and #8 were derived from 49 fundamental aminoacid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA)”.

It is this set of eight property classes that the authors have used as dependent variables in multiple linear regression models (MLR) of candidate hydrophobic scales. The measure of goodness of fit of a MLR model (R² value) was used to identify optimal hydrophobic scale(s) as according to the authors the MLR’s R² value represents “rigorous test for a robust, high performance hydrophobicity scale”. The rationale for such assumption was that for all tested hydrophobicity scales, R² value of MLR models were higher than R² values of simple regression models using either Moelbert’s average aminoacid solvent Accessible Surface Area (ASA) or Fauchere & Pliska free energy of amino-acid transfer from n-Octanol to water (Gtow) as the dependent variable.

However, I disagree with the authors on the MLR R² rationale as I have concerns about appropriateness of data analysis (see below). In addition, the reporting in the manuscript should be substantially improved.

Major comments
Data analysis
Comment 1
Throughout the paper description of MLR models is very confusing and it is not clear what models were actually run (what was the dependent and independent variables; also the estimated coefficients and statistical significance of independent variables were not shown for a single model). As already stated, it seems that MLR models were mainly used to model different hydrophobic scales using eight property classes as independent variables. However on the page 7 the authors state “The amino-acid property classes are vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties.” which is very confusing – it seems that in addition to eight property classes, a few additional independent variables (amino-acid physico-chemical properties) were also included in a model or it was the multivariate regression model that was used (if yes – what were dependent variables)?

If indeed the authors used MLR models as claimed, this means that for a model with 20 observations (20 amino acids) at least eight independent variables were included in a model. Moreover, as majority of these property classes were actually ordinal variables, the number of independent variables included in MLR model should have been even higher (due to introduction of dummy variables). Consequently the sample size of these models was far too small to estimate model’s parameters precisely, and the reported R² was actually quite inflated (due to an overfitted model but also a large number of independent variables that the authors did not adjusted for when comparing simple linear regression models and MLR ). In addition, there was a problem of multicollinearity between independent variables which additionally inflated MLR’s R² (based on the Table 2 Kendall tau coefficient between i.e. PC2 and PC4 is 0.82, P<0.001). Therefore the main result of this paper which is based on assumption that the property classes of amino acid residues identified through ‘linear-clusters’ represent “real properties of amino-acids within context of folded proteins” is not based on validated assumption.

Comment 2
The authors have used a measure of goodness of fit of a regression model, R² to compare strength of linear relationship(s) modelled in different regression models (including simple and multiple linear regression models). However, R² is an overused statistics for linear regression analysis and additional metrics are required to get the whole picture. In particular, it is a Pearson correlation coefficient between paired data (i.e. two hydrophobicity scales) that quantifies the degree to which two variables are related and is a proper statistical measure of the strength of a linear relationship. Linear regression models find the best-fit line that predicts dependent variable from independent variable(s) with R² actually representing squared Pearson correlation of the fitted values and the observed values.

Reporting
Comment 3
A reader should know precisely which scatter plots were screened for “linear-cluster” pattern. This means that the entire set of hydrophobic scales and other physicochemical properties of amino acid residues/amino acids that were collected from the literature and used for generation of these plots should be listed in the paper. Also, it should be specified how many scatter plots were finally generated (in example: N*(N-1)/2 where N – number of hydrophobic scales or physicochemical amino acid properties that were collected, ...).

Comment 4
Since the eight property classes of amino acid residues are the most important novelty of this paper, the process of their identification should be clearly described in a sufficient detail. In particular:

What was the reasoning behind the assumption that the property classes of amino acid residues identified through ‘linear-clusters’ represent “real properties of amino-acids within context of folded proteins”. Or there was no assumption and the fact that the regression models of all hydrophobicity scales exhibited the highest R2 values when these property classes were used as independent variables actually justified such interpretation. If latter was the case, such reasoning would not be justified (see comments on multiple linear regression analysis)
The authors should describe the method they used to identify linear clusters on a plot (i.e. visual identification, followed by analysis of amino acid physicochemical/biochemical properties in clusters and regression-analysis of clusters that confirmed the cluster status or something else)
How did the authors end up with the final set of 6 (or 3?) scatter plots from which they have derived their property classes? Were “linear-clusters” identified only in these plots or did the authors select the final plots based on relevance of plotted variables in folded proteins. If latter was true – what was the criteria they used to identify the most relevant scatter-plots
All property classes including the classes #6, #7, and #8 should be precisely defined. The description “classes #6, #7 and #8 were derived from 49 fundamental amino acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA)” is unacceptable. Which of 49 fundamental amino acid properties and their derived scales were used, and how, to identify property classes from #6 to #8.
Scatter-plots that were used for generation of classes from 5 to 8 should be shown.

5 - Comments on the reporting style

The Introduction section is quite short – the authors should elaborate more on relevant physico-chemical properties of amino acids and their importance in protein folding in this section.
There are parts of the Introduction in the Results section (the first paragraph) and the Discussion section (alignment matrices).
The hydrophobic scale that was chosen as the optimal one was normalized average of three published hydrophobicity scales that were found “most robust in correlation analyses” with robustness vaguely defined in the Methods section as associations to “multiple fundamental properties of the 20 natural amino acids using multi-variate statistical procedures, thermodynamics and biophysical chemistry considerations”. It is just latter, at the very end of the Results section that one can find out that “robust” scales are actually those whose MLR models using property classes as dependent variables exhibited highest R2 values. The Methods section should be written more clearly.
Table 2 – The labelling of Table 2 should be improved as authors keep explaining what is presented in which column of the Table 2 throughout the Results section.

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 15 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

15 Oct 2020

Author Response
Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of ... Continue reading
Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA) ..."

The 49 AA properties are represented by:

2 sequence frequency scales

6 secondary structure propensity scales

8 hydrophobicity scales

4 free energy scales (in water, protein folding/unfolding)

9 HPLC retention time scales

4 probability of an AA inside a folded protein core or on the outside

7 molecular property scales

9 physical (measured) property scales

The ANOPA method is a pattern recognition, pattern projection method. The ANOPA procedure projects n-space pattern point/vectors into a 3 dimensional object which is a cylinder. The axis of the cylinder, which is termed the relation vector, is formed from the pattern point centroid (averages of each AA property for all points) to an outgroup average point (averages of each AA property for a pair of selected points). The outgroup pair are selected on the basis of a histogram of the pattern point Euclidian distances from the pattern point centroid. Two pattern point distances are calculated with respect to the relation vector from a projection of each point onto the relation vector yielding the distance along the relation vector and the distance from the relation vector. The angle of rotation of each pattern point projection onto the relation vector is calculated. Thus, a cylindrical coordinate system is formed which is then converted into rectilinear coordinates. The X’ and Y’ rectilinear points are formed from each pattern point’s pattern projection distance times the cosine and sine respectively of the pattern projection angle of rotation. The Z’ point is simply the pattern projection intersection distance along the relation vector.

There is a relatively flat and linear structure to the higher order correlation structure in the 49 AA property pattern space. The 3D ANOPA X’ coordinate has a strong linear relationship (R²=95.13%) with the angle of rotation Ar pattern point/vector relation vector projection vectors. Similarly, the 3D ANOPA Y’ coordinate has a strong linear correlation (R²=99.86%) with the d2 distances of the pattern points/vectors from the relation vector.

Each of the 3D ANOPA coordinates have meaningful interpretations/associations with amino-acid properties. The X’ coordinates have 3 good linear series when scatter plotted with AA refractivity [179] and AA mass/(area/volume) [192], and it has 4 linear series with the Tanford hydrophobicity scale [197]. The AA refractivity scale has a strong linear relationship with the AA mass/(area/volume) derived scale, except that Glycine is off the regression line. The Y’ coordinates have 2 linear series when scatter plotted against the DNA/RNA numerically encoded (0-63) lexical table (UCAG, UCTG) averages and 3 linear series when scatter plotted against the AA property derived scale 2*H-B-DB. In the latter scale H is our hydrophobicity proclivity scales, B is a beta sheet proclivity scale derived from several published statistical scales and DB is the double bend proclivity scale derived from several literature statistical scales. The 2*H-B-DB scale reliably predicts (a number of proteins were sampled) the presence (or propensity) of alpha helices (value>0) and beta sheets (value<0) with sinusoidal trends (running average of 3) as seen with the primary sequence AA ordinal numbers plotted along the X axis. Since the Y’ coordinate has a strong relationship with the 2*H-B-DB scale, which itself has a strong relationship with protein secondary structure we see that the Y’ coordinate also has a relationship with protein secondary structure. We also can infer that there is a definite relationship between protein secondary structure and the DNA code, which in itself is related to AA hydrophobicity that we have documented in the hydrophobicity paper and the fact that the secondary structure proclivity scale by definition is related to hydrophobicity. Finally, the 3D ANOPA coordinate Z’ strongly correlates with all of the AA hydrophobicity scales (and some of the AA HPLC scales) in our database, both those used in the 49 property ANOPA analysis and those that were not used in this analysis.

"... MLR R² value represents a rigorous test for a robust, high performance hydrophobicity scale ... The rationale for such an assumption was that for all tested hydrophobicity scales, the R² values of the Multiple Linear Regression (MLR) were higher than the R² values of the simple binary regression models ... I disagree with the authors on the MLR R² rationale as I have concerns about the appropriateness of the data analysis"

The argument presented is not that the MLR regression was superior to that of the binary regression because of higher correlation coefficients, although possibly that may be true, but rather that the AA property class MLR represented more information in that it represented the behaviors of amino-acids in a larger series of contexts and properties. Also, the argument implied by having a MLR with the 8 property class scales is that for each context the amino-acids partition into sub-sets and that these different context sub-sets join together to determine amino-acid behavior in protein folding, interactions with water, interactions with membranes, secondary structure and electronic behaviors associated with AA-AA interactions. Moreover, we argue that within any given regression that the higher the correlation coefficient, the stronger the evidence for the superiority of a given hydrophobicity scale within that regression criterion. Top performance of a hydrophobicity scale within several regression relationship criterions based upon different property information leads to a performance/robustness conclusion based upon the consilience of the evidence.

We use the coefficient of determination (R²) rather than the correlation coefficient because it is a much more conservative statistic than the correlation coefficient (R) and can meaningfully be interpreted as the percent linearity between the dependent variable and the X independent variable(s) in the regression. Concerns might be raised about two points in the property class MLR, which are the possibility of inter-correlation between X variables and the loss of degrees of freedom in the regression statistics (over-fitting). We deal with these concerns in several fashions. First off, we calculate a T test significance only for the regression itself and not for any individual X variable regression coefficient since the Y variable regression performance itself is what is being measured. Secondly, the correlation coefficient T test (null hypothesis being that R=0) number of degrees of freedom (20) is reduce by 2 for each X variable coefficient in the regression and by 1 for the Y axis intercept. In the AA property class MLR we calculate the resulting degrees of freedom in the regression and correlation coefficient T test as 20-(2*8+1) =3. We allow for some dimensionality reduction due to inter-X pair correlation as long as the dimensionality reduction (i.e. percent reduction in the number of states with discrete variables) is not large and new information is being brought forth between each pair of X variables. Where there is no new information adduced by inclusion of a X variable we find that the magnitude of the correlation coefficients are reduced and significantly reduced by the squaring process to get the coefficient of determination. Furthermore, we demand that any set of property class vectors produce large significant coefficient of determinations with a large number of disparate AA property scales in our database. The latter criterion produces a very, very high degree of statistical confidence that the collection of property class scales can serve as a highly robust basis set for MLR vector regressions that can evaluate the robustness/significance of individual AA property scales.

"... the reporting in the manuscript should be substantially improved ..."

We agree and are revising the manuscript accordingly.

"Throughout the paper the description of the MLR models is very confusing ... it is not clear what models were actually run ... what was the dependent and independent variables ... what was the estimated regression coefficients and their statistical significance ..."

We have addressed these comments to some extent above. Additionally we will say that we will move all of the related discussion of regression methodology into the methods section where the discussion will make more sense. The key point is that the MLR X’s are not single variables, but rather are column vectors with each column vector having values for each amino-acid. In this context we equate the concept of a scale with the concept of a column vector. Having made this distinction, the concept of a Multi-Linear Regression (MLR) is still valid and simply represents a broader mathematical context. Our use of the MLR methodology with the 8 property class scales is to evaluate a correlation relationship as whole, so the entire regression correlation is what is used/important and the statistical significance of individual variable regression coefficients is meaningless. If the MLR method was being used to fit a variable for the purpose of extrapolating new Y values with scale values outside of the basis set, such as with new amino-acids, then the statistical significance of variable coefficients would be germane.

"eight property classes ... vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties .. which is very confusing ... seems like more independent variables are being added than the eight property class variables ..."

One must keep in mind that the MLR method that we are using uses vectors (scales) of dimensionality of 20 (one for each AA) and not individual variables. When doing AA property scale pair scatter plots between each property scale selected for our database we often find clustering behavior which indicates that the amino-acids are partitioning into distinct sub-sets and each amino-acid can be assigned a numerical value for the ordinal number of the sub-set into which it partitions. The clustering relationships uncovered for any particular pair of two AA property scales is generally driven by molecular geometry, secondary group geometry, numbers and types of atoms/inter-atomic bonding in the secondary group, molecular size, molecular mass, molecular volume, molecular surface area and entropy distribution about the molecule. No given pair of AA property scale relationships represent the full range of these molecular properties and the nature of the interaction with water and cellular membranes. The key point though is that there are distinct sets and sub-sets, which when numerically encoded can jointly describe each amino-acid as a row vector of some dimensionality M. We have found through extensive analysis that M is a very good size. Within each property pair scatterplot where clustering occurs, we can have different patterns such as unstructured clusters, multiple quasi-parallel linear clusters or multiple linear clusters that intersect at some point (often at glycine or alanine). Generally, there is a geometric ordering that allows the assignment of a meaningful ordinal number. To reiterate, a property class scale (column vector) represents a relationship between property scales and the values assigned within the scale to each amino-acid are their respective sub-set/cluster ordinal number. Property class scales assembled in this way can be used as one of the basis set vectors to be used to evaluate the performance and reliability of any individual amino-acid property scale. We also need to note that the three property class vectors defined between the 3D ANOPA coordinate system planes represent the correlation based dimensionality reduction from a pattern space of dimension 49 to a pattern space of dimension 3.

"... majority of these property classes were actually ordinal variables ... more actual variables should have been added owing to the introduction of dummy variables ... sample size of these models too small to estimate model parameters precisely ... "

In fact all of the numbers within the 8 property class scales are ordinal numbers representing real facts of clustering and the distinct geometrical ordering of the clusters. The practice of using assigned numbers to sets/sub-sets of objects, especially if the numbers can represent some form of ordering, is widely used in multi-variate statistical procedures such as MLR regression, discriminant analysis, Principal Component Analysis (PCA)and procedures cognate to PCA, as is found in fielids such as numerical taxonomy, computational biology and other fields that numerically encode state data. The authors have enjoyed great success in using discrete data to represent state data in a number of contexts over a number of years. Since the MLR regressions are not used for predictive purposes, but rather to express the degree of correlation and relatedness as expressed in the correlation coefficient and its T test based P value, the concept if regression coefficient statistical significance is a non-sequitur.

"... R² quite inflated ... over fitted model ... large number of independent variables that the authors did not adjust for compared to simple linear binary "regression models ..."

The authors would point out that additional variables do not inflate R² values unless there is a very serious inter-variable correlation between a large sub-set of the X variables. To the contrary, we find that working with the amino-acid property scales that we have assembled that there is generally an artificial deflation of R² relationships. Therefore, we do not agree that the introduction of variables into an MLR artificially inflates a MLR correlation versus a binary regression correlation. We adjust for these concerns in three ways: use of the coefficient of determination statistic is very conservative and we screen out a number of potentially significant relationships thereby; we use a T test to assess the statistical significance of the whole regression relationship and reduce the number of degrees of freedom of the T test accordingly as described above; we uncover the specific physical meanings reflected in each property scale relationship cluster to ensure that the physical-chemical factors at play are distinct and bring new information to the table.

"... problem of multicollinearity between independent variables ... inflated MLR R² ... table 2 PC2 and PC4 had a Kendall tau coefficient of 0.82, P<0.001"

We note that property classes 1 and 2 devolve from the relationship between the hydrophobicity proclivity scale and specific absolute entropy and that property classes 3 and 4 devolve from the relationship between the hydrophobicity proclivity scale and the specific volume as seen in figures 2 and 1 respectively. Both the specific absolute entropy and the specific volume are thermodynamic intrinsic property scales formed by the normalization of the amino-acid molecular volumes and molecular absolute entropy by their molecular masses, which create two secondary or derived scales from fundamental molecular properties that have the molecular mass in common. Furthermore, both the total entropy and the volume are related in some way to the molecular mass, so dividing by the molecular mass in both cases offset that dependence and leaves an intrinsic average property. Property class 2 partitions into two linear subgroups on the basis of polar and non-polar amino-acids. Property class 4 partitions into 4 linear subgroups on the basis of size, relative amounts of non-polar area and the presence/absence of a relatively strong or weak polar group. While property class 4 does reflect the polar/non-polar distinction to some extent, it does not reflect a strong and clear partition of AA by polarity or non-polarity, therefore new information is introduced by the addition of property class 4. We note that there are 2*4 possible combinations of states between property classes 2 and 4, but in fact there are actually 5 states formed giving a dimensionality reduction of 1-5/8 =37.5%, which we consider to be acceptable. We do not see strong enhancement of MLR correlation coefficients with any statistically significant AA property regression with the 8 property class scales. Given the nature of the interrelationship of fundamental molecular properties based upon AA molecular size, molecular geometry, entropy/entropy density, mass/mass density, surface area or electronic configuration/hybridization, some inter-property class correlation is inevitable and must be tolerated to make any progress in understanding amino-acid physico-chemical properties and protein folding. See the additional relevant discussion above.

"... main result of this paper ... based upon assumption that amino acid property classes forming linear clusters ... within context of folded proteins ... is not a valid assumption"

We agree that the 8 property class scales are one of the major results of this study, but we do not agree that it is the only one. We do not assume that the 8 amino-acid property class scales are a major determinate of protein folding per se, but rather we assume the fundamental molecular properties of each amino-acid in the context of a folded protein environment or the context of contact with a water environment are what determine protein folding. The assemblage of the 8 amino-acid property class scales that we report reflects a wide spread clustering pattern (sometimes in linear series) in 2D scatterplots and 3D ANOPA scatterplots between two or more scales. The 8 amino-acid property class scales we have assembled for our present study reflect the best of the clustering patterns forming unique subsets of amino-acids in different contexts/molecular properties with regard to hydrophobicity scales, but also with a number of other amino-acid physico-chemical property scales, secondary structure statistical propensity scales and molecular property scales. To the extent that any given hydrophobicity scales correlate with, and were determined from folded proteins, reflecting secondary/super-secondary structure of folded proteins, then the 8 property classes that we report are relevant to folded proteins. The 8 property classes that we report have statistically significant (P<0.05) MLR regressions with 124 amino-acid property scales in our assembled database, which implies that the amino-acid clustering behaviors reflected in these numerical scales can largely be reproduced with the clustering behaviors of the amino-acids. While the idea of linking of hydrophobicity to protein folding is an important idea, the finding that nature has selected the 20 natural amino-acids on the basis of distinct sub-sets/clustering in several different joint property dimensions is a significant finding in and of itself, perhaps leading to criteria for the engineered selection of un-natural amino-acids that could provide unique properties to active enzymes/proteins.

"... R² is an overused statistic for linear regression analysis and additional metrics are required to get the whole picture ...R² actually represents the square of the Pearson correlation coefficient"

We do not agree with the contention that the R² statistic is trivial and meaningless from overuse. Rather we point out that R² is a more conservative statistic than is the correlation coefficient and rolls off much more quickly, as well as having an important interpretation as the percent linearity that directly indicates the relative scatter of the regression calculated Y’s with respect to the actual Y’s.

"... A reader should know precisely which scatter plots were screened for a 'linear-cluster' pattern ... which hydrophobic scales and other physico-chemical properties of amino-acids were used for the scatter plots ... how many scales used"

There were approximately 175 amino-acid property scales used for our study, of which maybe 15-20 were derivative scales prepared from ratios of fundamental molecular properties such as volume and mass, which would represent intrinsic versus extrinsic scales in the thermodynamical sense. About another 10-15 scales represent averages of literature reported amino-acid property scales. Please note that linear clusters were not the other patterns seen. There were some non-linear patterns and simple clusters for example. Also note that the scales selected represent a very wide range of amino-acid properties including secondary structure statistical propensity scales, fundamental molecular properties like surface area/mass/volume, bulk properties such as index of refraction/melting point/pK-C, HPLC retention times, free energy in water, solvent/water partitioning, average fraction occurring in proteins, average amino-acid burial in proteins/amino-acid exposure wot water in folded proteins, hydrophobicity scales, NMR parameters, Rf parameters, several literature Principal Component Analysis/Factor Analysis parametric scales and other miscellaneous property scales. Of the 175 amino-acid property scales (and about ~200 total scales) 124 scales had statistically significant (P<0.05) MLR regression correlation with the 8 property classes that we have reported.

"... the eight amino-acid property classes most important novelty of this paper ... what is the reasoning that AA property classes represent real AA physico-chemical properties in the context of folded proteins ... rather was the assumption that high R² from MLR justifies assertion of relatedness to physico-chemical properties ... a high R² not a valid basis for assumption ..."

We agree that the 8 amino-acid property scales are novel and an important finding. We disagree with the comments on the MLR regression correlation and assert that high MLR R² values are both statistically and physically valid measures for reasons discussed above. The high and statistically significant coefficients of determinations are spread over many types of amino-acid physico-chemical property scales, including structural statistical propensity scales and hydrophobicity scales. We find that there is a strong relationship between amino-acid HPLC retention times and a number of the higher quality hydrophobicity scales in our assembled database, hence there is a direct link between a measureable amino-acid bulk property and amino-acid partitioning behavior in the structure of folded proteins.

"... the authors should describe the method that they used to identify linear clusters on a plot ... follow up analysis of physico-chemical/biochemical properties of clusters ... regression analysis ..."

We did describe most of the points commented here in the original manuscripts, but agree that the descriptions need to be expanded upon. There were a number of linear clustering patterns, amongst other patterns, that can be described as multiple quasi-parallel series, multi-linear series that intersect at a given amino-acid serving as a quasi or virtual origin (e.g. Karplus 1997), quasi-parallel cross hatched linear patterns (e.g. figures 1 & 2 in the manuscript). We ran, although we did not report a regression correlation coefficient on each series to assess its relative linearity. Where we found what we discerned as significant relationships, we evaluated those putative relationships from the point of view of physical reality as expressed by fundamental molecular properties and potential relationships with water, especially with aqueous clathrate membranes.

" ... how did the authors end up with the final set of scatter plots from which AA property classes were derived ... were they linear clusters ... did authors select scatterplots on basis of plotted variable relevance ... what was the criteria used to select most relevant scatterplots ..."

Our primary and starting premise was to look for single linear or non-linear patterns in all amino-acids in the scatterplot of any two given scales. Along the way we found that multiple linear series occurring in scatterplots between many pairs of amino-acid properties, which was a pattern we could not ignore. Many of these multi-family linear series ranged between clearly discernable to very high quality; the latter end of the range being what we concentrated on. We also cross checked to see if sister AA property scales in the same class such as hydrophobicity resulted in similar patterns, such as which we see in figures 1 and 2 and in table 2 of the manuscript. Once the linear/non-linear (single and multi-family) patterns were found a detailed review of these patterns were made to find underlying physico-chemical and biological reasons as well as statistical generalizations. The most relevant scatter plots were selected based upon their quality (visual and linear/non-linear regression) and explanatory power for protein structure and function and reliability/specificity for protein alignments.

"... all property classes including #6, #7 & #8 should be precisely defined ... derived from 49 fundamental amino-acid properties/derived scales ... Analysis of Patterns (ANOPA) ... need expanded description ..."

We have defined property classes #1 through #5, but we will expand upon these definitions. We will define property classes #6-#8 with plots and physical explanations. We will describe the ANOPA procedure and explain the specific 49 amino-acid property scale ANOPA analysis used in this study. The 49 amino-acid property scales used in the ANOPA analysis are property scales gleaned from the literature and no derivative scales were used for this portion of the overall analysis. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

" ... show scatter plots for property classes #5 through #8 ..."

We agree to revise the manuscript accordingly.

"... need expanded introduction ... expand more on AA physico-chemical properties relevant to protein folding ..."

We agree to modify the manuscript according to these comments, except that we note that he scope of this paper is to define our hydrophobicity scale and its application to protein alignments, whereas we take up the challenge of applying this work to protein folding with an extensive analysis in our follow up paper “Hydrophobicity Revisited: a Molecular Story.” This latter manuscript is in an advanced state of preparation and can be provided for review, which is recommended since the treatment of the material in this manuscript is way beyond what we can put into the current manuscript you have reviewed. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

"... there is material in the results section (first paragraph) and discussion section (alignment matrices) that belong in the introduction section ..."

We will revise the manuscript per these comments.

"... hydrophobic scale chosen as optimal ... normalized average of 3 hydrophobicity scales ... most robust in correlation analysis ... robustness vaguely defined in methods section as association to multiple, fundamental AA properties using multivariate statistical procedures, thermodynamics and biophysical chemistry considerations ... later find out that this means R² values derived from MLR models ..."

We have expanded upon what this means in our responses to your review. We will incorporate this material into the manuscript and significantly beef up the methods section to accommodate the spirit of your comments. We also note that the statistical correlations were only part of the rationale for defining a “robust” hydrophobicity scale, which is based upon bringing a coherent theoretical analysis to bear upon this work as well. We will be adding some additional material requested by Dr. Carter that should speak to these comments. We offer to make available for your review the TMATCH theory and application papers to provide additional justification for what constitutes a robust hydrophobicity scale as we use the term.

"The methods section needs to be written more clearly"

We agree and will modify the manuscript accordingly.

"... Need better tag description for table 2 owing to the considerable discussion of table two columns in the manuscript ..."

We agree and will modify the manuscript accordingly.
Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA) ..."

The 49 AA properties are represented by:

2 sequence frequency scales

6 secondary structure propensity scales

8 hydrophobicity scales

4 free energy scales (in water, protein folding/unfolding)

9 HPLC retention time scales

4 probability of an AA inside a folded protein core or on the outside

7 molecular property scales

9 physical (measured) property scales

The ANOPA method is a pattern recognition, pattern projection method. The ANOPA procedure projects n-space pattern point/vectors into a 3 dimensional object which is a cylinder. The axis of the cylinder, which is termed the relation vector, is formed from the pattern point centroid (averages of each AA property for all points) to an outgroup average point (averages of each AA property for a pair of selected points). The outgroup pair are selected on the basis of a histogram of the pattern point Euclidian distances from the pattern point centroid. Two pattern point distances are calculated with respect to the relation vector from a projection of each point onto the relation vector yielding the distance along the relation vector and the distance from the relation vector. The angle of rotation of each pattern point projection onto the relation vector is calculated. Thus, a cylindrical coordinate system is formed which is then converted into rectilinear coordinates. The X’ and Y’ rectilinear points are formed from each pattern point’s pattern projection distance times the cosine and sine respectively of the pattern projection angle of rotation. The Z’ point is simply the pattern projection intersection distance along the relation vector.

There is a relatively flat and linear structure to the higher order correlation structure in the 49 AA property pattern space. The 3D ANOPA X’ coordinate has a strong linear relationship (R²=95.13%) with the angle of rotation Ar pattern point/vector relation vector projection vectors. Similarly, the 3D ANOPA Y’ coordinate has a strong linear correlation (R²=99.86%) with the d2 distances of the pattern points/vectors from the relation vector.

Each of the 3D ANOPA coordinates have meaningful interpretations/associations with amino-acid properties. The X’ coordinates have 3 good linear series when scatter plotted with AA refractivity [179] and AA mass/(area/volume) [192], and it has 4 linear series with the Tanford hydrophobicity scale [197]. The AA refractivity scale has a strong linear relationship with the AA mass/(area/volume) derived scale, except that Glycine is off the regression line. The Y’ coordinates have 2 linear series when scatter plotted against the DNA/RNA numerically encoded (0-63) lexical table (UCAG, UCTG) averages and 3 linear series when scatter plotted against the AA property derived scale 2*H-B-DB. In the latter scale H is our hydrophobicity proclivity scales, B is a beta sheet proclivity scale derived from several published statistical scales and DB is the double bend proclivity scale derived from several literature statistical scales. The 2*H-B-DB scale reliably predicts (a number of proteins were sampled) the presence (or propensity) of alpha helices (value>0) and beta sheets (value<0) with sinusoidal trends (running average of 3) as seen with the primary sequence AA ordinal numbers plotted along the X axis. Since the Y’ coordinate has a strong relationship with the 2*H-B-DB scale, which itself has a strong relationship with protein secondary structure we see that the Y’ coordinate also has a relationship with protein secondary structure. We also can infer that there is a definite relationship between protein secondary structure and the DNA code, which in itself is related to AA hydrophobicity that we have documented in the hydrophobicity paper and the fact that the secondary structure proclivity scale by definition is related to hydrophobicity. Finally, the 3D ANOPA coordinate Z’ strongly correlates with all of the AA hydrophobicity scales (and some of the AA HPLC scales) in our database, both those used in the 49 property ANOPA analysis and those that were not used in this analysis.

"... MLR R² value represents a rigorous test for a robust, high performance hydrophobicity scale ... The rationale for such an assumption was that for all tested hydrophobicity scales, the R² values of the Multiple Linear Regression (MLR) were higher than the R² values of the simple binary regression models ... I disagree with the authors on the MLR R² rationale as I have concerns about the appropriateness of the data analysis"

The argument presented is not that the MLR regression was superior to that of the binary regression because of higher correlation coefficients, although possibly that may be true, but rather that the AA property class MLR represented more information in that it represented the behaviors of amino-acids in a larger series of contexts and properties. Also, the argument implied by having a MLR with the 8 property class scales is that for each context the amino-acids partition into sub-sets and that these different context sub-sets join together to determine amino-acid behavior in protein folding, interactions with water, interactions with membranes, secondary structure and electronic behaviors associated with AA-AA interactions. Moreover, we argue that within any given regression that the higher the correlation coefficient, the stronger the evidence for the superiority of a given hydrophobicity scale within that regression criterion. Top performance of a hydrophobicity scale within several regression relationship criterions based upon different property information leads to a performance/robustness conclusion based upon the consilience of the evidence.

We use the coefficient of determination (R²) rather than the correlation coefficient because it is a much more conservative statistic than the correlation coefficient (R) and can meaningfully be interpreted as the percent linearity between the dependent variable and the X independent variable(s) in the regression. Concerns might be raised about two points in the property class MLR, which are the possibility of inter-correlation between X variables and the loss of degrees of freedom in the regression statistics (over-fitting). We deal with these concerns in several fashions. First off, we calculate a T test significance only for the regression itself and not for any individual X variable regression coefficient since the Y variable regression performance itself is what is being measured. Secondly, the correlation coefficient T test (null hypothesis being that R=0) number of degrees of freedom (20) is reduce by 2 for each X variable coefficient in the regression and by 1 for the Y axis intercept. In the AA property class MLR we calculate the resulting degrees of freedom in the regression and correlation coefficient T test as 20-(2*8+1) =3. We allow for some dimensionality reduction due to inter-X pair correlation as long as the dimensionality reduction (i.e. percent reduction in the number of states with discrete variables) is not large and new information is being brought forth between each pair of X variables. Where there is no new information adduced by inclusion of a X variable we find that the magnitude of the correlation coefficients are reduced and significantly reduced by the squaring process to get the coefficient of determination. Furthermore, we demand that any set of property class vectors produce large significant coefficient of determinations with a large number of disparate AA property scales in our database. The latter criterion produces a very, very high degree of statistical confidence that the collection of property class scales can serve as a highly robust basis set for MLR vector regressions that can evaluate the robustness/significance of individual AA property scales.

"... the reporting in the manuscript should be substantially improved ..."

We agree and are revising the manuscript accordingly.

"Throughout the paper the description of the MLR models is very confusing ... it is not clear what models were actually run ... what was the dependent and independent variables ... what was the estimated regression coefficients and their statistical significance ..."

We have addressed these comments to some extent above. Additionally we will say that we will move all of the related discussion of regression methodology into the methods section where the discussion will make more sense. The key point is that the MLR X’s are not single variables, but rather are column vectors with each column vector having values for each amino-acid. In this context we equate the concept of a scale with the concept of a column vector. Having made this distinction, the concept of a Multi-Linear Regression (MLR) is still valid and simply represents a broader mathematical context. Our use of the MLR methodology with the 8 property class scales is to evaluate a correlation relationship as whole, so the entire regression correlation is what is used/important and the statistical significance of individual variable regression coefficients is meaningless. If the MLR method was being used to fit a variable for the purpose of extrapolating new Y values with scale values outside of the basis set, such as with new amino-acids, then the statistical significance of variable coefficients would be germane.

"eight property classes ... vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties .. which is very confusing ... seems like more independent variables are being added than the eight property class variables ..."

One must keep in mind that the MLR method that we are using uses vectors (scales) of dimensionality of 20 (one for each AA) and not individual variables. When doing AA property scale pair scatter plots between each property scale selected for our database we often find clustering behavior which indicates that the amino-acids are partitioning into distinct sub-sets and each amino-acid can be assigned a numerical value for the ordinal number of the sub-set into which it partitions. The clustering relationships uncovered for any particular pair of two AA property scales is generally driven by molecular geometry, secondary group geometry, numbers and types of atoms/inter-atomic bonding in the secondary group, molecular size, molecular mass, molecular volume, molecular surface area and entropy distribution about the molecule. No given pair of AA property scale relationships represent the full range of these molecular properties and the nature of the interaction with water and cellular membranes. The key point though is that there are distinct sets and sub-sets, which when numerically encoded can jointly describe each amino-acid as a row vector of some dimensionality M. We have found through extensive analysis that M is a very good size. Within each property pair scatterplot where clustering occurs, we can have different patterns such as unstructured clusters, multiple quasi-parallel linear clusters or multiple linear clusters that intersect at some point (often at glycine or alanine). Generally, there is a geometric ordering that allows the assignment of a meaningful ordinal number. To reiterate, a property class scale (column vector) represents a relationship between property scales and the values assigned within the scale to each amino-acid are their respective sub-set/cluster ordinal number. Property class scales assembled in this way can be used as one of the basis set vectors to be used to evaluate the performance and reliability of any individual amino-acid property scale. We also need to note that the three property class vectors defined between the 3D ANOPA coordinate system planes represent the correlation based dimensionality reduction from a pattern space of dimension 49 to a pattern space of dimension 3.

"... majority of these property classes were actually ordinal variables ... more actual variables should have been added owing to the introduction of dummy variables ... sample size of these models too small to estimate model parameters precisely ... "

In fact all of the numbers within the 8 property class scales are ordinal numbers representing real facts of clustering and the distinct geometrical ordering of the clusters. The practice of using assigned numbers to sets/sub-sets of objects, especially if the numbers can represent some form of ordering, is widely used in multi-variate statistical procedures such as MLR regression, discriminant analysis, Principal Component Analysis (PCA)and procedures cognate to PCA, as is found in fielids such as numerical taxonomy, computational biology and other fields that numerically encode state data. The authors have enjoyed great success in using discrete data to represent state data in a number of contexts over a number of years. Since the MLR regressions are not used for predictive purposes, but rather to express the degree of correlation and relatedness as expressed in the correlation coefficient and its T test based P value, the concept if regression coefficient statistical significance is a non-sequitur.

"... R² quite inflated ... over fitted model ... large number of independent variables that the authors did not adjust for compared to simple linear binary "regression models ..."

The authors would point out that additional variables do not inflate R² values unless there is a very serious inter-variable correlation between a large sub-set of the X variables. To the contrary, we find that working with the amino-acid property scales that we have assembled that there is generally an artificial deflation of R² relationships. Therefore, we do not agree that the introduction of variables into an MLR artificially inflates a MLR correlation versus a binary regression correlation. We adjust for these concerns in three ways: use of the coefficient of determination statistic is very conservative and we screen out a number of potentially significant relationships thereby; we use a T test to assess the statistical significance of the whole regression relationship and reduce the number of degrees of freedom of the T test accordingly as described above; we uncover the specific physical meanings reflected in each property scale relationship cluster to ensure that the physical-chemical factors at play are distinct and bring new information to the table.

"... problem of multicollinearity between independent variables ... inflated MLR R² ... table 2 PC2 and PC4 had a Kendall tau coefficient of 0.82, P<0.001"

We note that property classes 1 and 2 devolve from the relationship between the hydrophobicity proclivity scale and specific absolute entropy and that property classes 3 and 4 devolve from the relationship between the hydrophobicity proclivity scale and the specific volume as seen in figures 2 and 1 respectively. Both the specific absolute entropy and the specific volume are thermodynamic intrinsic property scales formed by the normalization of the amino-acid molecular volumes and molecular absolute entropy by their molecular masses, which create two secondary or derived scales from fundamental molecular properties that have the molecular mass in common. Furthermore, both the total entropy and the volume are related in some way to the molecular mass, so dividing by the molecular mass in both cases offset that dependence and leaves an intrinsic average property. Property class 2 partitions into two linear subgroups on the basis of polar and non-polar amino-acids. Property class 4 partitions into 4 linear subgroups on the basis of size, relative amounts of non-polar area and the presence/absence of a relatively strong or weak polar group. While property class 4 does reflect the polar/non-polar distinction to some extent, it does not reflect a strong and clear partition of AA by polarity or non-polarity, therefore new information is introduced by the addition of property class 4. We note that there are 2*4 possible combinations of states between property classes 2 and 4, but in fact there are actually 5 states formed giving a dimensionality reduction of 1-5/8 =37.5%, which we consider to be acceptable. We do not see strong enhancement of MLR correlation coefficients with any statistically significant AA property regression with the 8 property class scales. Given the nature of the interrelationship of fundamental molecular properties based upon AA molecular size, molecular geometry, entropy/entropy density, mass/mass density, surface area or electronic configuration/hybridization, some inter-property class correlation is inevitable and must be tolerated to make any progress in understanding amino-acid physico-chemical properties and protein folding. See the additional relevant discussion above.

"... main result of this paper ... based upon assumption that amino acid property classes forming linear clusters ... within context of folded proteins ... is not a valid assumption"

We agree that the 8 property class scales are one of the major results of this study, but we do not agree that it is the only one. We do not assume that the 8 amino-acid property class scales are a major determinate of protein folding per se, but rather we assume the fundamental molecular properties of each amino-acid in the context of a folded protein environment or the context of contact with a water environment are what determine protein folding. The assemblage of the 8 amino-acid property class scales that we report reflects a wide spread clustering pattern (sometimes in linear series) in 2D scatterplots and 3D ANOPA scatterplots between two or more scales. The 8 amino-acid property class scales we have assembled for our present study reflect the best of the clustering patterns forming unique subsets of amino-acids in different contexts/molecular properties with regard to hydrophobicity scales, but also with a number of other amino-acid physico-chemical property scales, secondary structure statistical propensity scales and molecular property scales. To the extent that any given hydrophobicity scales correlate with, and were determined from folded proteins, reflecting secondary/super-secondary structure of folded proteins, then the 8 property classes that we report are relevant to folded proteins. The 8 property classes that we report have statistically significant (P<0.05) MLR regressions with 124 amino-acid property scales in our assembled database, which implies that the amino-acid clustering behaviors reflected in these numerical scales can largely be reproduced with the clustering behaviors of the amino-acids. While the idea of linking of hydrophobicity to protein folding is an important idea, the finding that nature has selected the 20 natural amino-acids on the basis of distinct sub-sets/clustering in several different joint property dimensions is a significant finding in and of itself, perhaps leading to criteria for the engineered selection of un-natural amino-acids that could provide unique properties to active enzymes/proteins.

"... R² is an overused statistic for linear regression analysis and additional metrics are required to get the whole picture ...R² actually represents the square of the Pearson correlation coefficient"

We do not agree with the contention that the R² statistic is trivial and meaningless from overuse. Rather we point out that R² is a more conservative statistic than is the correlation coefficient and rolls off much more quickly, as well as having an important interpretation as the percent linearity that directly indicates the relative scatter of the regression calculated Y’s with respect to the actual Y’s.

"... A reader should know precisely which scatter plots were screened for a 'linear-cluster' pattern ... which hydrophobic scales and other physico-chemical properties of amino-acids were used for the scatter plots ... how many scales used"

There were approximately 175 amino-acid property scales used for our study, of which maybe 15-20 were derivative scales prepared from ratios of fundamental molecular properties such as volume and mass, which would represent intrinsic versus extrinsic scales in the thermodynamical sense. About another 10-15 scales represent averages of literature reported amino-acid property scales. Please note that linear clusters were not the other patterns seen. There were some non-linear patterns and simple clusters for example. Also note that the scales selected represent a very wide range of amino-acid properties including secondary structure statistical propensity scales, fundamental molecular properties like surface area/mass/volume, bulk properties such as index of refraction/melting point/pK-C, HPLC retention times, free energy in water, solvent/water partitioning, average fraction occurring in proteins, average amino-acid burial in proteins/amino-acid exposure wot water in folded proteins, hydrophobicity scales, NMR parameters, Rf parameters, several literature Principal Component Analysis/Factor Analysis parametric scales and other miscellaneous property scales. Of the 175 amino-acid property scales (and about ~200 total scales) 124 scales had statistically significant (P<0.05) MLR regression correlation with the 8 property classes that we have reported.

"... the eight amino-acid property classes most important novelty of this paper ... what is the reasoning that AA property classes represent real AA physico-chemical properties in the context of folded proteins ... rather was the assumption that high R² from MLR justifies assertion of relatedness to physico-chemical properties ... a high R² not a valid basis for assumption ..."

We agree that the 8 amino-acid property scales are novel and an important finding. We disagree with the comments on the MLR regression correlation and assert that high MLR R² values are both statistically and physically valid measures for reasons discussed above. The high and statistically significant coefficients of determinations are spread over many types of amino-acid physico-chemical property scales, including structural statistical propensity scales and hydrophobicity scales. We find that there is a strong relationship between amino-acid HPLC retention times and a number of the higher quality hydrophobicity scales in our assembled database, hence there is a direct link between a measureable amino-acid bulk property and amino-acid partitioning behavior in the structure of folded proteins.

"... the authors should describe the method that they used to identify linear clusters on a plot ... follow up analysis of physico-chemical/biochemical properties of clusters ... regression analysis ..."

We did describe most of the points commented here in the original manuscripts, but agree that the descriptions need to be expanded upon. There were a number of linear clustering patterns, amongst other patterns, that can be described as multiple quasi-parallel series, multi-linear series that intersect at a given amino-acid serving as a quasi or virtual origin (e.g. Karplus 1997), quasi-parallel cross hatched linear patterns (e.g. figures 1 & 2 in the manuscript). We ran, although we did not report a regression correlation coefficient on each series to assess its relative linearity. Where we found what we discerned as significant relationships, we evaluated those putative relationships from the point of view of physical reality as expressed by fundamental molecular properties and potential relationships with water, especially with aqueous clathrate membranes.

" ... how did the authors end up with the final set of scatter plots from which AA property classes were derived ... were they linear clusters ... did authors select scatterplots on basis of plotted variable relevance ... what was the criteria used to select most relevant scatterplots ..."

Our primary and starting premise was to look for single linear or non-linear patterns in all amino-acids in the scatterplot of any two given scales. Along the way we found that multiple linear series occurring in scatterplots between many pairs of amino-acid properties, which was a pattern we could not ignore. Many of these multi-family linear series ranged between clearly discernable to very high quality; the latter end of the range being what we concentrated on. We also cross checked to see if sister AA property scales in the same class such as hydrophobicity resulted in similar patterns, such as which we see in figures 1 and 2 and in table 2 of the manuscript. Once the linear/non-linear (single and multi-family) patterns were found a detailed review of these patterns were made to find underlying physico-chemical and biological reasons as well as statistical generalizations. The most relevant scatter plots were selected based upon their quality (visual and linear/non-linear regression) and explanatory power for protein structure and function and reliability/specificity for protein alignments.

"... all property classes including #6, #7 & #8 should be precisely defined ... derived from 49 fundamental amino-acid properties/derived scales ... Analysis of Patterns (ANOPA) ... need expanded description ..."

We have defined property classes #1 through #5, but we will expand upon these definitions. We will define property classes #6-#8 with plots and physical explanations. We will describe the ANOPA procedure and explain the specific 49 amino-acid property scale ANOPA analysis used in this study. The 49 amino-acid property scales used in the ANOPA analysis are property scales gleaned from the literature and no derivative scales were used for this portion of the overall analysis. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

" ... show scatter plots for property classes #5 through #8 ..."

We agree to revise the manuscript accordingly.

"... need expanded introduction ... expand more on AA physico-chemical properties relevant to protein folding ..."

We agree to modify the manuscript according to these comments, except that we note that he scope of this paper is to define our hydrophobicity scale and its application to protein alignments, whereas we take up the challenge of applying this work to protein folding with an extensive analysis in our follow up paper “Hydrophobicity Revisited: a Molecular Story.” This latter manuscript is in an advanced state of preparation and can be provided for review, which is recommended since the treatment of the material in this manuscript is way beyond what we can put into the current manuscript you have reviewed. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

"... there is material in the results section (first paragraph) and discussion section (alignment matrices) that belong in the introduction section ..."

We will revise the manuscript per these comments.

"... hydrophobic scale chosen as optimal ... normalized average of 3 hydrophobicity scales ... most robust in correlation analysis ... robustness vaguely defined in methods section as association to multiple, fundamental AA properties using multivariate statistical procedures, thermodynamics and biophysical chemistry considerations ... later find out that this means R² values derived from MLR models ..."

We have expanded upon what this means in our responses to your review. We will incorporate this material into the manuscript and significantly beef up the methods section to accommodate the spirit of your comments. We also note that the statistical correlations were only part of the rationale for defining a “robust” hydrophobicity scale, which is based upon bringing a coherent theoretical analysis to bear upon this work as well. We will be adding some additional material requested by Dr. Carter that should speak to these comments. We offer to make available for your review the TMATCH theory and application papers to provide additional justification for what constitutes a robust hydrophobicity scale as we use the term.

"The methods section needs to be written more clearly"

We agree and will modify the manuscript accordingly.

"... Need better tag description for table 2 owing to the considerable discussion of table two columns in the manuscript ..."

We agree and will modify the manuscript accordingly.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 15 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

15 Oct 2020

Author Response
Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of ... Continue reading
Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA) ..."

The 49 AA properties are represented by:

2 sequence frequency scales

6 secondary structure propensity scales

8 hydrophobicity scales

4 free energy scales (in water, protein folding/unfolding)

9 HPLC retention time scales

4 probability of an AA inside a folded protein core or on the outside

7 molecular property scales

9 physical (measured) property scales

The ANOPA method is a pattern recognition, pattern projection method. The ANOPA procedure projects n-space pattern point/vectors into a 3 dimensional object which is a cylinder. The axis of the cylinder, which is termed the relation vector, is formed from the pattern point centroid (averages of each AA property for all points) to an outgroup average point (averages of each AA property for a pair of selected points). The outgroup pair are selected on the basis of a histogram of the pattern point Euclidian distances from the pattern point centroid. Two pattern point distances are calculated with respect to the relation vector from a projection of each point onto the relation vector yielding the distance along the relation vector and the distance from the relation vector. The angle of rotation of each pattern point projection onto the relation vector is calculated. Thus, a cylindrical coordinate system is formed which is then converted into rectilinear coordinates. The X’ and Y’ rectilinear points are formed from each pattern point’s pattern projection distance times the cosine and sine respectively of the pattern projection angle of rotation. The Z’ point is simply the pattern projection intersection distance along the relation vector.

There is a relatively flat and linear structure to the higher order correlation structure in the 49 AA property pattern space. The 3D ANOPA X’ coordinate has a strong linear relationship (R²=95.13%) with the angle of rotation Ar pattern point/vector relation vector projection vectors. Similarly, the 3D ANOPA Y’ coordinate has a strong linear correlation (R²=99.86%) with the d2 distances of the pattern points/vectors from the relation vector.

Each of the 3D ANOPA coordinates have meaningful interpretations/associations with amino-acid properties. The X’ coordinates have 3 good linear series when scatter plotted with AA refractivity [179] and AA mass/(area/volume) [192], and it has 4 linear series with the Tanford hydrophobicity scale [197]. The AA refractivity scale has a strong linear relationship with the AA mass/(area/volume) derived scale, except that Glycine is off the regression line. The Y’ coordinates have 2 linear series when scatter plotted against the DNA/RNA numerically encoded (0-63) lexical table (UCAG, UCTG) averages and 3 linear series when scatter plotted against the AA property derived scale 2*H-B-DB. In the latter scale H is our hydrophobicity proclivity scales, B is a beta sheet proclivity scale derived from several published statistical scales and DB is the double bend proclivity scale derived from several literature statistical scales. The 2*H-B-DB scale reliably predicts (a number of proteins were sampled) the presence (or propensity) of alpha helices (value>0) and beta sheets (value<0) with sinusoidal trends (running average of 3) as seen with the primary sequence AA ordinal numbers plotted along the X axis. Since the Y’ coordinate has a strong relationship with the 2*H-B-DB scale, which itself has a strong relationship with protein secondary structure we see that the Y’ coordinate also has a relationship with protein secondary structure. We also can infer that there is a definite relationship between protein secondary structure and the DNA code, which in itself is related to AA hydrophobicity that we have documented in the hydrophobicity paper and the fact that the secondary structure proclivity scale by definition is related to hydrophobicity. Finally, the 3D ANOPA coordinate Z’ strongly correlates with all of the AA hydrophobicity scales (and some of the AA HPLC scales) in our database, both those used in the 49 property ANOPA analysis and those that were not used in this analysis.

"... MLR R² value represents a rigorous test for a robust, high performance hydrophobicity scale ... The rationale for such an assumption was that for all tested hydrophobicity scales, the R² values of the Multiple Linear Regression (MLR) were higher than the R² values of the simple binary regression models ... I disagree with the authors on the MLR R² rationale as I have concerns about the appropriateness of the data analysis"

The argument presented is not that the MLR regression was superior to that of the binary regression because of higher correlation coefficients, although possibly that may be true, but rather that the AA property class MLR represented more information in that it represented the behaviors of amino-acids in a larger series of contexts and properties. Also, the argument implied by having a MLR with the 8 property class scales is that for each context the amino-acids partition into sub-sets and that these different context sub-sets join together to determine amino-acid behavior in protein folding, interactions with water, interactions with membranes, secondary structure and electronic behaviors associated with AA-AA interactions. Moreover, we argue that within any given regression that the higher the correlation coefficient, the stronger the evidence for the superiority of a given hydrophobicity scale within that regression criterion. Top performance of a hydrophobicity scale within several regression relationship criterions based upon different property information leads to a performance/robustness conclusion based upon the consilience of the evidence.

We use the coefficient of determination (R²) rather than the correlation coefficient because it is a much more conservative statistic than the correlation coefficient (R) and can meaningfully be interpreted as the percent linearity between the dependent variable and the X independent variable(s) in the regression. Concerns might be raised about two points in the property class MLR, which are the possibility of inter-correlation between X variables and the loss of degrees of freedom in the regression statistics (over-fitting). We deal with these concerns in several fashions. First off, we calculate a T test significance only for the regression itself and not for any individual X variable regression coefficient since the Y variable regression performance itself is what is being measured. Secondly, the correlation coefficient T test (null hypothesis being that R=0) number of degrees of freedom (20) is reduce by 2 for each X variable coefficient in the regression and by 1 for the Y axis intercept. In the AA property class MLR we calculate the resulting degrees of freedom in the regression and correlation coefficient T test as 20-(2*8+1) =3. We allow for some dimensionality reduction due to inter-X pair correlation as long as the dimensionality reduction (i.e. percent reduction in the number of states with discrete variables) is not large and new information is being brought forth between each pair of X variables. Where there is no new information adduced by inclusion of a X variable we find that the magnitude of the correlation coefficients are reduced and significantly reduced by the squaring process to get the coefficient of determination. Furthermore, we demand that any set of property class vectors produce large significant coefficient of determinations with a large number of disparate AA property scales in our database. The latter criterion produces a very, very high degree of statistical confidence that the collection of property class scales can serve as a highly robust basis set for MLR vector regressions that can evaluate the robustness/significance of individual AA property scales.

"... the reporting in the manuscript should be substantially improved ..."

We agree and are revising the manuscript accordingly.

"Throughout the paper the description of the MLR models is very confusing ... it is not clear what models were actually run ... what was the dependent and independent variables ... what was the estimated regression coefficients and their statistical significance ..."

We have addressed these comments to some extent above. Additionally we will say that we will move all of the related discussion of regression methodology into the methods section where the discussion will make more sense. The key point is that the MLR X’s are not single variables, but rather are column vectors with each column vector having values for each amino-acid. In this context we equate the concept of a scale with the concept of a column vector. Having made this distinction, the concept of a Multi-Linear Regression (MLR) is still valid and simply represents a broader mathematical context. Our use of the MLR methodology with the 8 property class scales is to evaluate a correlation relationship as whole, so the entire regression correlation is what is used/important and the statistical significance of individual variable regression coefficients is meaningless. If the MLR method was being used to fit a variable for the purpose of extrapolating new Y values with scale values outside of the basis set, such as with new amino-acids, then the statistical significance of variable coefficients would be germane.

"eight property classes ... vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties .. which is very confusing ... seems like more independent variables are being added than the eight property class variables ..."

One must keep in mind that the MLR method that we are using uses vectors (scales) of dimensionality of 20 (one for each AA) and not individual variables. When doing AA property scale pair scatter plots between each property scale selected for our database we often find clustering behavior which indicates that the amino-acids are partitioning into distinct sub-sets and each amino-acid can be assigned a numerical value for the ordinal number of the sub-set into which it partitions. The clustering relationships uncovered for any particular pair of two AA property scales is generally driven by molecular geometry, secondary group geometry, numbers and types of atoms/inter-atomic bonding in the secondary group, molecular size, molecular mass, molecular volume, molecular surface area and entropy distribution about the molecule. No given pair of AA property scale relationships represent the full range of these molecular properties and the nature of the interaction with water and cellular membranes. The key point though is that there are distinct sets and sub-sets, which when numerically encoded can jointly describe each amino-acid as a row vector of some dimensionality M. We have found through extensive analysis that M is a very good size. Within each property pair scatterplot where clustering occurs, we can have different patterns such as unstructured clusters, multiple quasi-parallel linear clusters or multiple linear clusters that intersect at some point (often at glycine or alanine). Generally, there is a geometric ordering that allows the assignment of a meaningful ordinal number. To reiterate, a property class scale (column vector) represents a relationship between property scales and the values assigned within the scale to each amino-acid are their respective sub-set/cluster ordinal number. Property class scales assembled in this way can be used as one of the basis set vectors to be used to evaluate the performance and reliability of any individual amino-acid property scale. We also need to note that the three property class vectors defined between the 3D ANOPA coordinate system planes represent the correlation based dimensionality reduction from a pattern space of dimension 49 to a pattern space of dimension 3.

"... majority of these property classes were actually ordinal variables ... more actual variables should have been added owing to the introduction of dummy variables ... sample size of these models too small to estimate model parameters precisely ... "

In fact all of the numbers within the 8 property class scales are ordinal numbers representing real facts of clustering and the distinct geometrical ordering of the clusters. The practice of using assigned numbers to sets/sub-sets of objects, especially if the numbers can represent some form of ordering, is widely used in multi-variate statistical procedures such as MLR regression, discriminant analysis, Principal Component Analysis (PCA)and procedures cognate to PCA, as is found in fielids such as numerical taxonomy, computational biology and other fields that numerically encode state data. The authors have enjoyed great success in using discrete data to represent state data in a number of contexts over a number of years. Since the MLR regressions are not used for predictive purposes, but rather to express the degree of correlation and relatedness as expressed in the correlation coefficient and its T test based P value, the concept if regression coefficient statistical significance is a non-sequitur.

"... R² quite inflated ... over fitted model ... large number of independent variables that the authors did not adjust for compared to simple linear binary "regression models ..."

The authors would point out that additional variables do not inflate R² values unless there is a very serious inter-variable correlation between a large sub-set of the X variables. To the contrary, we find that working with the amino-acid property scales that we have assembled that there is generally an artificial deflation of R² relationships. Therefore, we do not agree that the introduction of variables into an MLR artificially inflates a MLR correlation versus a binary regression correlation. We adjust for these concerns in three ways: use of the coefficient of determination statistic is very conservative and we screen out a number of potentially significant relationships thereby; we use a T test to assess the statistical significance of the whole regression relationship and reduce the number of degrees of freedom of the T test accordingly as described above; we uncover the specific physical meanings reflected in each property scale relationship cluster to ensure that the physical-chemical factors at play are distinct and bring new information to the table.

"... problem of multicollinearity between independent variables ... inflated MLR R² ... table 2 PC2 and PC4 had a Kendall tau coefficient of 0.82, P<0.001"

We note that property classes 1 and 2 devolve from the relationship between the hydrophobicity proclivity scale and specific absolute entropy and that property classes 3 and 4 devolve from the relationship between the hydrophobicity proclivity scale and the specific volume as seen in figures 2 and 1 respectively. Both the specific absolute entropy and the specific volume are thermodynamic intrinsic property scales formed by the normalization of the amino-acid molecular volumes and molecular absolute entropy by their molecular masses, which create two secondary or derived scales from fundamental molecular properties that have the molecular mass in common. Furthermore, both the total entropy and the volume are related in some way to the molecular mass, so dividing by the molecular mass in both cases offset that dependence and leaves an intrinsic average property. Property class 2 partitions into two linear subgroups on the basis of polar and non-polar amino-acids. Property class 4 partitions into 4 linear subgroups on the basis of size, relative amounts of non-polar area and the presence/absence of a relatively strong or weak polar group. While property class 4 does reflect the polar/non-polar distinction to some extent, it does not reflect a strong and clear partition of AA by polarity or non-polarity, therefore new information is introduced by the addition of property class 4. We note that there are 2*4 possible combinations of states between property classes 2 and 4, but in fact there are actually 5 states formed giving a dimensionality reduction of 1-5/8 =37.5%, which we consider to be acceptable. We do not see strong enhancement of MLR correlation coefficients with any statistically significant AA property regression with the 8 property class scales. Given the nature of the interrelationship of fundamental molecular properties based upon AA molecular size, molecular geometry, entropy/entropy density, mass/mass density, surface area or electronic configuration/hybridization, some inter-property class correlation is inevitable and must be tolerated to make any progress in understanding amino-acid physico-chemical properties and protein folding. See the additional relevant discussion above.

"... main result of this paper ... based upon assumption that amino acid property classes forming linear clusters ... within context of folded proteins ... is not a valid assumption"

We agree that the 8 property class scales are one of the major results of this study, but we do not agree that it is the only one. We do not assume that the 8 amino-acid property class scales are a major determinate of protein folding per se, but rather we assume the fundamental molecular properties of each amino-acid in the context of a folded protein environment or the context of contact with a water environment are what determine protein folding. The assemblage of the 8 amino-acid property class scales that we report reflects a wide spread clustering pattern (sometimes in linear series) in 2D scatterplots and 3D ANOPA scatterplots between two or more scales. The 8 amino-acid property class scales we have assembled for our present study reflect the best of the clustering patterns forming unique subsets of amino-acids in different contexts/molecular properties with regard to hydrophobicity scales, but also with a number of other amino-acid physico-chemical property scales, secondary structure statistical propensity scales and molecular property scales. To the extent that any given hydrophobicity scales correlate with, and were determined from folded proteins, reflecting secondary/super-secondary structure of folded proteins, then the 8 property classes that we report are relevant to folded proteins. The 8 property classes that we report have statistically significant (P<0.05) MLR regressions with 124 amino-acid property scales in our assembled database, which implies that the amino-acid clustering behaviors reflected in these numerical scales can largely be reproduced with the clustering behaviors of the amino-acids. While the idea of linking of hydrophobicity to protein folding is an important idea, the finding that nature has selected the 20 natural amino-acids on the basis of distinct sub-sets/clustering in several different joint property dimensions is a significant finding in and of itself, perhaps leading to criteria for the engineered selection of un-natural amino-acids that could provide unique properties to active enzymes/proteins.

"... R² is an overused statistic for linear regression analysis and additional metrics are required to get the whole picture ...R² actually represents the square of the Pearson correlation coefficient"

We do not agree with the contention that the R² statistic is trivial and meaningless from overuse. Rather we point out that R² is a more conservative statistic than is the correlation coefficient and rolls off much more quickly, as well as having an important interpretation as the percent linearity that directly indicates the relative scatter of the regression calculated Y’s with respect to the actual Y’s.

"... A reader should know precisely which scatter plots were screened for a 'linear-cluster' pattern ... which hydrophobic scales and other physico-chemical properties of amino-acids were used for the scatter plots ... how many scales used"

There were approximately 175 amino-acid property scales used for our study, of which maybe 15-20 were derivative scales prepared from ratios of fundamental molecular properties such as volume and mass, which would represent intrinsic versus extrinsic scales in the thermodynamical sense. About another 10-15 scales represent averages of literature reported amino-acid property scales. Please note that linear clusters were not the other patterns seen. There were some non-linear patterns and simple clusters for example. Also note that the scales selected represent a very wide range of amino-acid properties including secondary structure statistical propensity scales, fundamental molecular properties like surface area/mass/volume, bulk properties such as index of refraction/melting point/pK-C, HPLC retention times, free energy in water, solvent/water partitioning, average fraction occurring in proteins, average amino-acid burial in proteins/amino-acid exposure wot water in folded proteins, hydrophobicity scales, NMR parameters, Rf parameters, several literature Principal Component Analysis/Factor Analysis parametric scales and other miscellaneous property scales. Of the 175 amino-acid property scales (and about ~200 total scales) 124 scales had statistically significant (P<0.05) MLR regression correlation with the 8 property classes that we have reported.

"... the eight amino-acid property classes most important novelty of this paper ... what is the reasoning that AA property classes represent real AA physico-chemical properties in the context of folded proteins ... rather was the assumption that high R² from MLR justifies assertion of relatedness to physico-chemical properties ... a high R² not a valid basis for assumption ..."

We agree that the 8 amino-acid property scales are novel and an important finding. We disagree with the comments on the MLR regression correlation and assert that high MLR R² values are both statistically and physically valid measures for reasons discussed above. The high and statistically significant coefficients of determinations are spread over many types of amino-acid physico-chemical property scales, including structural statistical propensity scales and hydrophobicity scales. We find that there is a strong relationship between amino-acid HPLC retention times and a number of the higher quality hydrophobicity scales in our assembled database, hence there is a direct link between a measureable amino-acid bulk property and amino-acid partitioning behavior in the structure of folded proteins.

"... the authors should describe the method that they used to identify linear clusters on a plot ... follow up analysis of physico-chemical/biochemical properties of clusters ... regression analysis ..."

We did describe most of the points commented here in the original manuscripts, but agree that the descriptions need to be expanded upon. There were a number of linear clustering patterns, amongst other patterns, that can be described as multiple quasi-parallel series, multi-linear series that intersect at a given amino-acid serving as a quasi or virtual origin (e.g. Karplus 1997), quasi-parallel cross hatched linear patterns (e.g. figures 1 & 2 in the manuscript). We ran, although we did not report a regression correlation coefficient on each series to assess its relative linearity. Where we found what we discerned as significant relationships, we evaluated those putative relationships from the point of view of physical reality as expressed by fundamental molecular properties and potential relationships with water, especially with aqueous clathrate membranes.

" ... how did the authors end up with the final set of scatter plots from which AA property classes were derived ... were they linear clusters ... did authors select scatterplots on basis of plotted variable relevance ... what was the criteria used to select most relevant scatterplots ..."

Our primary and starting premise was to look for single linear or non-linear patterns in all amino-acids in the scatterplot of any two given scales. Along the way we found that multiple linear series occurring in scatterplots between many pairs of amino-acid properties, which was a pattern we could not ignore. Many of these multi-family linear series ranged between clearly discernable to very high quality; the latter end of the range being what we concentrated on. We also cross checked to see if sister AA property scales in the same class such as hydrophobicity resulted in similar patterns, such as which we see in figures 1 and 2 and in table 2 of the manuscript. Once the linear/non-linear (single and multi-family) patterns were found a detailed review of these patterns were made to find underlying physico-chemical and biological reasons as well as statistical generalizations. The most relevant scatter plots were selected based upon their quality (visual and linear/non-linear regression) and explanatory power for protein structure and function and reliability/specificity for protein alignments.

"... all property classes including #6, #7 & #8 should be precisely defined ... derived from 49 fundamental amino-acid properties/derived scales ... Analysis of Patterns (ANOPA) ... need expanded description ..."

We have defined property classes #1 through #5, but we will expand upon these definitions. We will define property classes #6-#8 with plots and physical explanations. We will describe the ANOPA procedure and explain the specific 49 amino-acid property scale ANOPA analysis used in this study. The 49 amino-acid property scales used in the ANOPA analysis are property scales gleaned from the literature and no derivative scales were used for this portion of the overall analysis. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

" ... show scatter plots for property classes #5 through #8 ..."

We agree to revise the manuscript accordingly.

"... need expanded introduction ... expand more on AA physico-chemical properties relevant to protein folding ..."

We agree to modify the manuscript according to these comments, except that we note that he scope of this paper is to define our hydrophobicity scale and its application to protein alignments, whereas we take up the challenge of applying this work to protein folding with an extensive analysis in our follow up paper “Hydrophobicity Revisited: a Molecular Story.” This latter manuscript is in an advanced state of preparation and can be provided for review, which is recommended since the treatment of the material in this manuscript is way beyond what we can put into the current manuscript you have reviewed. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

"... there is material in the results section (first paragraph) and discussion section (alignment matrices) that belong in the introduction section ..."

We will revise the manuscript per these comments.

"... hydrophobic scale chosen as optimal ... normalized average of 3 hydrophobicity scales ... most robust in correlation analysis ... robustness vaguely defined in methods section as association to multiple, fundamental AA properties using multivariate statistical procedures, thermodynamics and biophysical chemistry considerations ... later find out that this means R² values derived from MLR models ..."

We have expanded upon what this means in our responses to your review. We will incorporate this material into the manuscript and significantly beef up the methods section to accommodate the spirit of your comments. We also note that the statistical correlations were only part of the rationale for defining a “robust” hydrophobicity scale, which is based upon bringing a coherent theoretical analysis to bear upon this work as well. We will be adding some additional material requested by Dr. Carter that should speak to these comments. We offer to make available for your review the TMATCH theory and application papers to provide additional justification for what constitutes a robust hydrophobicity scale as we use the term.

"The methods section needs to be written more clearly"

We agree and will modify the manuscript accordingly.

"... Need better tag description for table 2 owing to the considerable discussion of table two columns in the manuscript ..."

We agree and will modify the manuscript accordingly.
Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA) ..."

The 49 AA properties are represented by:

2 sequence frequency scales

6 secondary structure propensity scales

8 hydrophobicity scales

4 free energy scales (in water, protein folding/unfolding)

9 HPLC retention time scales

4 probability of an AA inside a folded protein core or on the outside

7 molecular property scales

9 physical (measured) property scales

The ANOPA method is a pattern recognition, pattern projection method. The ANOPA procedure projects n-space pattern point/vectors into a 3 dimensional object which is a cylinder. The axis of the cylinder, which is termed the relation vector, is formed from the pattern point centroid (averages of each AA property for all points) to an outgroup average point (averages of each AA property for a pair of selected points). The outgroup pair are selected on the basis of a histogram of the pattern point Euclidian distances from the pattern point centroid. Two pattern point distances are calculated with respect to the relation vector from a projection of each point onto the relation vector yielding the distance along the relation vector and the distance from the relation vector. The angle of rotation of each pattern point projection onto the relation vector is calculated. Thus, a cylindrical coordinate system is formed which is then converted into rectilinear coordinates. The X’ and Y’ rectilinear points are formed from each pattern point’s pattern projection distance times the cosine and sine respectively of the pattern projection angle of rotation. The Z’ point is simply the pattern projection intersection distance along the relation vector.

There is a relatively flat and linear structure to the higher order correlation structure in the 49 AA property pattern space. The 3D ANOPA X’ coordinate has a strong linear relationship (R²=95.13%) with the angle of rotation Ar pattern point/vector relation vector projection vectors. Similarly, the 3D ANOPA Y’ coordinate has a strong linear correlation (R²=99.86%) with the d2 distances of the pattern points/vectors from the relation vector.

Each of the 3D ANOPA coordinates have meaningful interpretations/associations with amino-acid properties. The X’ coordinates have 3 good linear series when scatter plotted with AA refractivity [179] and AA mass/(area/volume) [192], and it has 4 linear series with the Tanford hydrophobicity scale [197]. The AA refractivity scale has a strong linear relationship with the AA mass/(area/volume) derived scale, except that Glycine is off the regression line. The Y’ coordinates have 2 linear series when scatter plotted against the DNA/RNA numerically encoded (0-63) lexical table (UCAG, UCTG) averages and 3 linear series when scatter plotted against the AA property derived scale 2*H-B-DB. In the latter scale H is our hydrophobicity proclivity scales, B is a beta sheet proclivity scale derived from several published statistical scales and DB is the double bend proclivity scale derived from several literature statistical scales. The 2*H-B-DB scale reliably predicts (a number of proteins were sampled) the presence (or propensity) of alpha helices (value>0) and beta sheets (value<0) with sinusoidal trends (running average of 3) as seen with the primary sequence AA ordinal numbers plotted along the X axis. Since the Y’ coordinate has a strong relationship with the 2*H-B-DB scale, which itself has a strong relationship with protein secondary structure we see that the Y’ coordinate also has a relationship with protein secondary structure. We also can infer that there is a definite relationship between protein secondary structure and the DNA code, which in itself is related to AA hydrophobicity that we have documented in the hydrophobicity paper and the fact that the secondary structure proclivity scale by definition is related to hydrophobicity. Finally, the 3D ANOPA coordinate Z’ strongly correlates with all of the AA hydrophobicity scales (and some of the AA HPLC scales) in our database, both those used in the 49 property ANOPA analysis and those that were not used in this analysis.

"... MLR R² value represents a rigorous test for a robust, high performance hydrophobicity scale ... The rationale for such an assumption was that for all tested hydrophobicity scales, the R² values of the Multiple Linear Regression (MLR) were higher than the R² values of the simple binary regression models ... I disagree with the authors on the MLR R² rationale as I have concerns about the appropriateness of the data analysis"

The argument presented is not that the MLR regression was superior to that of the binary regression because of higher correlation coefficients, although possibly that may be true, but rather that the AA property class MLR represented more information in that it represented the behaviors of amino-acids in a larger series of contexts and properties. Also, the argument implied by having a MLR with the 8 property class scales is that for each context the amino-acids partition into sub-sets and that these different context sub-sets join together to determine amino-acid behavior in protein folding, interactions with water, interactions with membranes, secondary structure and electronic behaviors associated with AA-AA interactions. Moreover, we argue that within any given regression that the higher the correlation coefficient, the stronger the evidence for the superiority of a given hydrophobicity scale within that regression criterion. Top performance of a hydrophobicity scale within several regression relationship criterions based upon different property information leads to a performance/robustness conclusion based upon the consilience of the evidence.

We use the coefficient of determination (R²) rather than the correlation coefficient because it is a much more conservative statistic than the correlation coefficient (R) and can meaningfully be interpreted as the percent linearity between the dependent variable and the X independent variable(s) in the regression. Concerns might be raised about two points in the property class MLR, which are the possibility of inter-correlation between X variables and the loss of degrees of freedom in the regression statistics (over-fitting). We deal with these concerns in several fashions. First off, we calculate a T test significance only for the regression itself and not for any individual X variable regression coefficient since the Y variable regression performance itself is what is being measured. Secondly, the correlation coefficient T test (null hypothesis being that R=0) number of degrees of freedom (20) is reduce by 2 for each X variable coefficient in the regression and by 1 for the Y axis intercept. In the AA property class MLR we calculate the resulting degrees of freedom in the regression and correlation coefficient T test as 20-(2*8+1) =3. We allow for some dimensionality reduction due to inter-X pair correlation as long as the dimensionality reduction (i.e. percent reduction in the number of states with discrete variables) is not large and new information is being brought forth between each pair of X variables. Where there is no new information adduced by inclusion of a X variable we find that the magnitude of the correlation coefficients are reduced and significantly reduced by the squaring process to get the coefficient of determination. Furthermore, we demand that any set of property class vectors produce large significant coefficient of determinations with a large number of disparate AA property scales in our database. The latter criterion produces a very, very high degree of statistical confidence that the collection of property class scales can serve as a highly robust basis set for MLR vector regressions that can evaluate the robustness/significance of individual AA property scales.

"... the reporting in the manuscript should be substantially improved ..."

We agree and are revising the manuscript accordingly.

"Throughout the paper the description of the MLR models is very confusing ... it is not clear what models were actually run ... what was the dependent and independent variables ... what was the estimated regression coefficients and their statistical significance ..."

We have addressed these comments to some extent above. Additionally we will say that we will move all of the related discussion of regression methodology into the methods section where the discussion will make more sense. The key point is that the MLR X’s are not single variables, but rather are column vectors with each column vector having values for each amino-acid. In this context we equate the concept of a scale with the concept of a column vector. Having made this distinction, the concept of a Multi-Linear Regression (MLR) is still valid and simply represents a broader mathematical context. Our use of the MLR methodology with the 8 property class scales is to evaluate a correlation relationship as whole, so the entire regression correlation is what is used/important and the statistical significance of individual variable regression coefficients is meaningless. If the MLR method was being used to fit a variable for the purpose of extrapolating new Y values with scale values outside of the basis set, such as with new amino-acids, then the statistical significance of variable coefficients would be germane.

"eight property classes ... vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties .. which is very confusing ... seems like more independent variables are being added than the eight property class variables ..."

One must keep in mind that the MLR method that we are using uses vectors (scales) of dimensionality of 20 (one for each AA) and not individual variables. When doing AA property scale pair scatter plots between each property scale selected for our database we often find clustering behavior which indicates that the amino-acids are partitioning into distinct sub-sets and each amino-acid can be assigned a numerical value for the ordinal number of the sub-set into which it partitions. The clustering relationships uncovered for any particular pair of two AA property scales is generally driven by molecular geometry, secondary group geometry, numbers and types of atoms/inter-atomic bonding in the secondary group, molecular size, molecular mass, molecular volume, molecular surface area and entropy distribution about the molecule. No given pair of AA property scale relationships represent the full range of these molecular properties and the nature of the interaction with water and cellular membranes. The key point though is that there are distinct sets and sub-sets, which when numerically encoded can jointly describe each amino-acid as a row vector of some dimensionality M. We have found through extensive analysis that M is a very good size. Within each property pair scatterplot where clustering occurs, we can have different patterns such as unstructured clusters, multiple quasi-parallel linear clusters or multiple linear clusters that intersect at some point (often at glycine or alanine). Generally, there is a geometric ordering that allows the assignment of a meaningful ordinal number. To reiterate, a property class scale (column vector) represents a relationship between property scales and the values assigned within the scale to each amino-acid are their respective sub-set/cluster ordinal number. Property class scales assembled in this way can be used as one of the basis set vectors to be used to evaluate the performance and reliability of any individual amino-acid property scale. We also need to note that the three property class vectors defined between the 3D ANOPA coordinate system planes represent the correlation based dimensionality reduction from a pattern space of dimension 49 to a pattern space of dimension 3.

"... majority of these property classes were actually ordinal variables ... more actual variables should have been added owing to the introduction of dummy variables ... sample size of these models too small to estimate model parameters precisely ... "

In fact all of the numbers within the 8 property class scales are ordinal numbers representing real facts of clustering and the distinct geometrical ordering of the clusters. The practice of using assigned numbers to sets/sub-sets of objects, especially if the numbers can represent some form of ordering, is widely used in multi-variate statistical procedures such as MLR regression, discriminant analysis, Principal Component Analysis (PCA)and procedures cognate to PCA, as is found in fielids such as numerical taxonomy, computational biology and other fields that numerically encode state data. The authors have enjoyed great success in using discrete data to represent state data in a number of contexts over a number of years. Since the MLR regressions are not used for predictive purposes, but rather to express the degree of correlation and relatedness as expressed in the correlation coefficient and its T test based P value, the concept if regression coefficient statistical significance is a non-sequitur.

"... R² quite inflated ... over fitted model ... large number of independent variables that the authors did not adjust for compared to simple linear binary "regression models ..."

The authors would point out that additional variables do not inflate R² values unless there is a very serious inter-variable correlation between a large sub-set of the X variables. To the contrary, we find that working with the amino-acid property scales that we have assembled that there is generally an artificial deflation of R² relationships. Therefore, we do not agree that the introduction of variables into an MLR artificially inflates a MLR correlation versus a binary regression correlation. We adjust for these concerns in three ways: use of the coefficient of determination statistic is very conservative and we screen out a number of potentially significant relationships thereby; we use a T test to assess the statistical significance of the whole regression relationship and reduce the number of degrees of freedom of the T test accordingly as described above; we uncover the specific physical meanings reflected in each property scale relationship cluster to ensure that the physical-chemical factors at play are distinct and bring new information to the table.

"... problem of multicollinearity between independent variables ... inflated MLR R² ... table 2 PC2 and PC4 had a Kendall tau coefficient of 0.82, P<0.001"

We note that property classes 1 and 2 devolve from the relationship between the hydrophobicity proclivity scale and specific absolute entropy and that property classes 3 and 4 devolve from the relationship between the hydrophobicity proclivity scale and the specific volume as seen in figures 2 and 1 respectively. Both the specific absolute entropy and the specific volume are thermodynamic intrinsic property scales formed by the normalization of the amino-acid molecular volumes and molecular absolute entropy by their molecular masses, which create two secondary or derived scales from fundamental molecular properties that have the molecular mass in common. Furthermore, both the total entropy and the volume are related in some way to the molecular mass, so dividing by the molecular mass in both cases offset that dependence and leaves an intrinsic average property. Property class 2 partitions into two linear subgroups on the basis of polar and non-polar amino-acids. Property class 4 partitions into 4 linear subgroups on the basis of size, relative amounts of non-polar area and the presence/absence of a relatively strong or weak polar group. While property class 4 does reflect the polar/non-polar distinction to some extent, it does not reflect a strong and clear partition of AA by polarity or non-polarity, therefore new information is introduced by the addition of property class 4. We note that there are 2*4 possible combinations of states between property classes 2 and 4, but in fact there are actually 5 states formed giving a dimensionality reduction of 1-5/8 =37.5%, which we consider to be acceptable. We do not see strong enhancement of MLR correlation coefficients with any statistically significant AA property regression with the 8 property class scales. Given the nature of the interrelationship of fundamental molecular properties based upon AA molecular size, molecular geometry, entropy/entropy density, mass/mass density, surface area or electronic configuration/hybridization, some inter-property class correlation is inevitable and must be tolerated to make any progress in understanding amino-acid physico-chemical properties and protein folding. See the additional relevant discussion above.

"... main result of this paper ... based upon assumption that amino acid property classes forming linear clusters ... within context of folded proteins ... is not a valid assumption"

We agree that the 8 property class scales are one of the major results of this study, but we do not agree that it is the only one. We do not assume that the 8 amino-acid property class scales are a major determinate of protein folding per se, but rather we assume the fundamental molecular properties of each amino-acid in the context of a folded protein environment or the context of contact with a water environment are what determine protein folding. The assemblage of the 8 amino-acid property class scales that we report reflects a wide spread clustering pattern (sometimes in linear series) in 2D scatterplots and 3D ANOPA scatterplots between two or more scales. The 8 amino-acid property class scales we have assembled for our present study reflect the best of the clustering patterns forming unique subsets of amino-acids in different contexts/molecular properties with regard to hydrophobicity scales, but also with a number of other amino-acid physico-chemical property scales, secondary structure statistical propensity scales and molecular property scales. To the extent that any given hydrophobicity scales correlate with, and were determined from folded proteins, reflecting secondary/super-secondary structure of folded proteins, then the 8 property classes that we report are relevant to folded proteins. The 8 property classes that we report have statistically significant (P<0.05) MLR regressions with 124 amino-acid property scales in our assembled database, which implies that the amino-acid clustering behaviors reflected in these numerical scales can largely be reproduced with the clustering behaviors of the amino-acids. While the idea of linking of hydrophobicity to protein folding is an important idea, the finding that nature has selected the 20 natural amino-acids on the basis of distinct sub-sets/clustering in several different joint property dimensions is a significant finding in and of itself, perhaps leading to criteria for the engineered selection of un-natural amino-acids that could provide unique properties to active enzymes/proteins.

"... R² is an overused statistic for linear regression analysis and additional metrics are required to get the whole picture ...R² actually represents the square of the Pearson correlation coefficient"

We do not agree with the contention that the R² statistic is trivial and meaningless from overuse. Rather we point out that R² is a more conservative statistic than is the correlation coefficient and rolls off much more quickly, as well as having an important interpretation as the percent linearity that directly indicates the relative scatter of the regression calculated Y’s with respect to the actual Y’s.

"... A reader should know precisely which scatter plots were screened for a 'linear-cluster' pattern ... which hydrophobic scales and other physico-chemical properties of amino-acids were used for the scatter plots ... how many scales used"

There were approximately 175 amino-acid property scales used for our study, of which maybe 15-20 were derivative scales prepared from ratios of fundamental molecular properties such as volume and mass, which would represent intrinsic versus extrinsic scales in the thermodynamical sense. About another 10-15 scales represent averages of literature reported amino-acid property scales. Please note that linear clusters were not the other patterns seen. There were some non-linear patterns and simple clusters for example. Also note that the scales selected represent a very wide range of amino-acid properties including secondary structure statistical propensity scales, fundamental molecular properties like surface area/mass/volume, bulk properties such as index of refraction/melting point/pK-C, HPLC retention times, free energy in water, solvent/water partitioning, average fraction occurring in proteins, average amino-acid burial in proteins/amino-acid exposure wot water in folded proteins, hydrophobicity scales, NMR parameters, Rf parameters, several literature Principal Component Analysis/Factor Analysis parametric scales and other miscellaneous property scales. Of the 175 amino-acid property scales (and about ~200 total scales) 124 scales had statistically significant (P<0.05) MLR regression correlation with the 8 property classes that we have reported.

"... the eight amino-acid property classes most important novelty of this paper ... what is the reasoning that AA property classes represent real AA physico-chemical properties in the context of folded proteins ... rather was the assumption that high R² from MLR justifies assertion of relatedness to physico-chemical properties ... a high R² not a valid basis for assumption ..."

We agree that the 8 amino-acid property scales are novel and an important finding. We disagree with the comments on the MLR regression correlation and assert that high MLR R² values are both statistically and physically valid measures for reasons discussed above. The high and statistically significant coefficients of determinations are spread over many types of amino-acid physico-chemical property scales, including structural statistical propensity scales and hydrophobicity scales. We find that there is a strong relationship between amino-acid HPLC retention times and a number of the higher quality hydrophobicity scales in our assembled database, hence there is a direct link between a measureable amino-acid bulk property and amino-acid partitioning behavior in the structure of folded proteins.

"... the authors should describe the method that they used to identify linear clusters on a plot ... follow up analysis of physico-chemical/biochemical properties of clusters ... regression analysis ..."

We did describe most of the points commented here in the original manuscripts, but agree that the descriptions need to be expanded upon. There were a number of linear clustering patterns, amongst other patterns, that can be described as multiple quasi-parallel series, multi-linear series that intersect at a given amino-acid serving as a quasi or virtual origin (e.g. Karplus 1997), quasi-parallel cross hatched linear patterns (e.g. figures 1 & 2 in the manuscript). We ran, although we did not report a regression correlation coefficient on each series to assess its relative linearity. Where we found what we discerned as significant relationships, we evaluated those putative relationships from the point of view of physical reality as expressed by fundamental molecular properties and potential relationships with water, especially with aqueous clathrate membranes.

" ... how did the authors end up with the final set of scatter plots from which AA property classes were derived ... were they linear clusters ... did authors select scatterplots on basis of plotted variable relevance ... what was the criteria used to select most relevant scatterplots ..."

Our primary and starting premise was to look for single linear or non-linear patterns in all amino-acids in the scatterplot of any two given scales. Along the way we found that multiple linear series occurring in scatterplots between many pairs of amino-acid properties, which was a pattern we could not ignore. Many of these multi-family linear series ranged between clearly discernable to very high quality; the latter end of the range being what we concentrated on. We also cross checked to see if sister AA property scales in the same class such as hydrophobicity resulted in similar patterns, such as which we see in figures 1 and 2 and in table 2 of the manuscript. Once the linear/non-linear (single and multi-family) patterns were found a detailed review of these patterns were made to find underlying physico-chemical and biological reasons as well as statistical generalizations. The most relevant scatter plots were selected based upon their quality (visual and linear/non-linear regression) and explanatory power for protein structure and function and reliability/specificity for protein alignments.

"... all property classes including #6, #7 & #8 should be precisely defined ... derived from 49 fundamental amino-acid properties/derived scales ... Analysis of Patterns (ANOPA) ... need expanded description ..."

We have defined property classes #1 through #5, but we will expand upon these definitions. We will define property classes #6-#8 with plots and physical explanations. We will describe the ANOPA procedure and explain the specific 49 amino-acid property scale ANOPA analysis used in this study. The 49 amino-acid property scales used in the ANOPA analysis are property scales gleaned from the literature and no derivative scales were used for this portion of the overall analysis. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

" ... show scatter plots for property classes #5 through #8 ..."

We agree to revise the manuscript accordingly.

"... need expanded introduction ... expand more on AA physico-chemical properties relevant to protein folding ..."

We agree to modify the manuscript according to these comments, except that we note that he scope of this paper is to define our hydrophobicity scale and its application to protein alignments, whereas we take up the challenge of applying this work to protein folding with an extensive analysis in our follow up paper “Hydrophobicity Revisited: a Molecular Story.” This latter manuscript is in an advanced state of preparation and can be provided for review, which is recommended since the treatment of the material in this manuscript is way beyond what we can put into the current manuscript you have reviewed. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

"... there is material in the results section (first paragraph) and discussion section (alignment matrices) that belong in the introduction section ..."

We will revise the manuscript per these comments.

"... hydrophobic scale chosen as optimal ... normalized average of 3 hydrophobicity scales ... most robust in correlation analysis ... robustness vaguely defined in methods section as association to multiple, fundamental AA properties using multivariate statistical procedures, thermodynamics and biophysical chemistry considerations ... later find out that this means R² values derived from MLR models ..."

We have expanded upon what this means in our responses to your review. We will incorporate this material into the manuscript and significantly beef up the methods section to accommodate the spirit of your comments. We also note that the statistical correlations were only part of the rationale for defining a “robust” hydrophobicity scale, which is based upon bringing a coherent theoretical analysis to bear upon this work as well. We will be adding some additional material requested by Dr. Carter that should speak to these comments. We offer to make available for your review the TMATCH theory and application papers to provide additional justification for what constitutes a robust hydrophobicity scale as we use the term.

"The methods section needs to be written more clearly"

We agree and will modify the manuscript accordingly.

"... Need better tag description for table 2 owing to the considerable discussion of table two columns in the manuscript ..."

We agree and will modify the manuscript accordingly.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 11 Jul 2016

Charles Carter, Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.6806.r14648

Peer Review Oath: I will be an ambassador for open science. I have benefited substantively from open reviews on several previous occasions, so I believe in its value. I will endeavor to be constructive, while at the same time remaining true to my own scientific values.

Review
This manuscript addresses a worthy problem: improving multiple sequence alignment via the use of enhanced amino acid similarity metrics would enhance our ability to draw inferences from sequences of proteins whose structures, were they known would establish homology, but which owing to divergence have unrecognizably homologous sequences. It seems almost certain that we should be able to do a better job at homology searches if more about how amino acid physical chemistry leads to protein structure. It was for this reason that I agreed to review this manuscript based on the abstract.
The authors allude to work they have done that demonstrates the value of the new scales they describe here, but there is essentially no coverage of this central question in this manuscript, which is disappointing and detracts substantially from the value of the paper.
The device advocated by the authors is a neologism they call a “hydrophobic proclivity index”. This index is the result of statistical modeling from a variety of different scales of what has been called “hydrophobicity” and their derivatives (with respect to which variables is not described) in order to maximize agreement between the scale and calculations of the exposure of each of the twenty amino acids in folded proteins. The resulting presentation is interesting and potentially relevant, but is deficient its citation of the literature, and in results indicating either their methods or the results to which they allude. I conclude that the although the work described is well-motivated, and may lead to better homology searches, it nevertheless suffers from a variety of methodological and conceptual problems that may in the end compromise the work quite seriously. These are summarized below.

The data base:
The quest for a single “predictor” for the degree to which each amino acid is exposed on average in folded proteins has a long history. The authors have cited just about every previous attempt to correlate the two variables, but have excluded the one set of experimental data representing the actual physical chemistry of the twenty amino acid side chains, the vapor to water and water to cyclohexane distributions of side chain mimics measured and re-measured by Wolfenden’s group ¹^-⁵. Wolfenden has argued persuasively that octanol is a very unsatisfactory reference solvent for a variety of reasons, in part because of the ability of side chains to bring variable amounts of bound water into it from aqueous solution.
Omitting the Wolfenden free energies is a grave oversight, because it means that the regression analyses they describe are looking for signal in a variety of data sets that have already been corrupted by similar unsuccessful attempts by previous investigators who have kludged the extant variety of multiple scales. For that reason, any useful result the present authors may have achieved is likely to be idiosyncratic and only indirectly based on physical chemistry. Moreover, the authors provide no evidence of statistical tests that might suggest significance, and the correlations they describe, some of which are more impressive than others, are very likely to be successful only in proportion to the number of parameters from which their models are built and, I suspect, of somewhat circular logic.

Relating protein structure to amino acid physical chemistry is very probably multi-dimensional.
It is very probable that the inability of previous researchers to arrive at a single scale that predicts the accessible surface area in folded proteins arises because the problem itself is multi-dimensional. The authors describe a variety of classification schemes derived from attempts to rationalize scatterplots of amino acid properties. Indeed, they mention that one useful additional classification is likely related to the size of the side chain.

Recommendation: The authors should read carefully the papers from Wolfenden and Carter ¹^,⁶^,⁷ in which those authors describe first the correlation between the free energies of vapor to water distribution coefficients and amino acid side chain volume, and second, their success in predicting Moelbert’s accessible surface areas using a two-dimensional coordinate system one axis of which is the free energies, respectively, of water to cyclohexane and vapor to cyclohexane partition coefficients.

In conclusion, what might be of interest in this paper is the TMATCH algorithm and the improvements it brings to homology searches. That is not described at all. Instead, there are a variety descriptions of how a multitude of idiosyncratic hydrophobiciy scales describing amino acid physical chemistry, notably excluding the (only) authentic ones, might be combined into one that predicts exposed accessible surface area by an algorithm that essentially produces a linear combination that is correlated with ASA by hidden, but nevertheless circular reasoning.

References

1. Wolfenden R, Lewis CA, Yuan Y, Carter CW: Temperature dependence of amino acid hydrophobicities.Proc Natl Acad Sci U S A. 2015; 112 (24): 7484-8 PubMed Abstract | Publisher Full Text
2. Wolfenden R: Experimental measures of amino acid hydrophobicity and the topology of transmembrane and globular proteins.J Gen Physiol. 2007; 129 (5): 357-62 PubMed Abstract | Publisher Full Text
3. Gibbs P, Radzicka A, Wolfenden R: The anomalous hydrophilic character of proline. Journal of the American Chemical Society. 1991; 113 (12): 4714-4715 Publisher Full Text
4. Radzicka A, Wolfenden R: Comparing the polarities of the amino acids: side-chain distribution coefficients between the vapor phase, cyclohexane, 1-octanol, and neutral aqueous solution. Biochemistry. 1988; 27 (5): 1664-1670 Publisher Full Text
5. Wolfenden RV, Cullis PM, Southgate CC: Water, protein folding, and the genetic code.Science. 1979; 206 (4418): 575-7 PubMed Abstract
6. Carter CW, Wolfenden R: tRNA acceptor-stem and anticodon bases embed separate features of amino acid chemistry.RNA Biol. 2016; 13 (2): 145-51 PubMed Abstract | Publisher Full Text
7. Carter CW, Wolfenden R: tRNA acceptor stem and anticodon bases form independent codes related to protein folding.Proc Natl Acad Sci U S A. 2015; 112 (24): 7489-94 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 15 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

15 Oct 2020

Author Response
Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors ... Continue reading
Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors allude to work they have done that demonstrates the value of the new scales they describe here, but there is essentially no coverage of this central question in this manuscript, which is disappointing and detracts substantially from the value of the paper"

This original work is described in 3 manuscripts, including this present paper. The other two manuscripts (theory and applications/results) describing the TMATCH alignment algorithm are almost ready for publication. These three manuscripts were all part of a single monograph that we had decided to partition up owing to length and the fact that the physical chemistry considerations of the current manuscript would be inappropriate for a bioinformatics journal and we needed to get this manuscript published to support another published paper (also available for review). We will incorporate into the present manuscript an appropriate level description of the TMATCH algorithm and actual results from these two manuscripts

A fourth manuscripts which goes significantly more deeply into the biophysics and thermodynamics of hydrophobicity and the broader justification for our hydrophobicity proclivity scale from these two points of view. We would like to provide you a copy of these three other manuscripts, which are in an advanced state of preparation, in answer to a number of your points. We agree with you that more summary material needs to be brought into this manuscript.

We point out the excellent linear relationship between our hydrophobicity proclivity scale and the dpx average residue depth (from water) scale in figure 5 of the manuscript. The dpx scale is derived from a suite of proteins with actual solved folded structures, hence dpx is in effect a protein structural description of hydrophobicity where the deeper on average that a residue is buried, the more hydrophobic it is. Contrast the dpx concept with the ideas of average solvent exposed are, average buried area, percent buried, etc. When the relationships in figures 3-5 (with supporting literature) are taken together, the hydrophobicity potential driving protein folding is suggested to be the traditional solvent-solvent partitioning model of protein folding where amino-acids partition between aqueous exposure and burial within the hydrophobic core of folded proteins. We believe that this traditional model of the primary driving force for globular protein folding is apt, but needs some updates and modifications that we develop in detail in the “Hydrophobicity revisited: A Molecular Story” manuscript. We suggest in the present manuscript that aqueous clathrates form about the hydrophobic area of amino-acids in an unfolded state (a structural feature) and hydrophobic surface area on folded proteins possess a surface tension like a gas bubble in water. We are suggesting that the physics driving the initial stages of protein folding is the same physics as the coalescing/condensing of bubbles of non-polar species or gas in water are essentially the same physics which cause the mutual attraction of residues based upon the residue hydrophobic area with associated aqueous clathrate. We expand upon this hypothesis in “Hydrophobicity Revisited: A Molecular Story.”

Our aim was to try to develop a scale which reliably represented the central hydrophobic tendency (average) as a central/first order effect that allowed a simple but meaningful paired comparison of amino-acids for homology relationships. Adding additional variables/properties would in our opinion have detracted from our goals of simplicity of calculation and utilization of what we believe is the primary first order effect driving protein folding.

"... this index is the result of statistical modeling from a variety of different scales of what has been called 'hydrophobicity' and derivatives ... derivative variables not described ... presentation is interesting and potentially relevant, but is deficient in its citation of the literature and in results indicating their methods or results to which they allude ... well motivated ... suffers from a variety of methodological and conceptual problems ..."

We agree that an expansion of what we meant by derivative variables should be included in the manuscript

The derivative variables concept referred to in the manuscript are primarily ratios of molecular properties or directly measureable bulk properties of amino-acids. In this way we derive normalized intrinsic thermodynamical properties versus extrinsic/bulk properties. For example, we find that the ratio of amino-acid surface area to their volume has a relationship to hydrophobicity.

Other derivative variables we used could be exemplified by estimating the water cavity volume/surface area using amino-acid volume and surface areas and some assumptions about the aqueous cavity geometry.

We will be expanding our reference list of amino-acid scales utilized to better define the amino-acid property scales we utilized in addition to hydrophobicity scales.

We will also include the work of Wolfenden et. al. into the manuscript.

"... the authors have cited just about every effort to correlate a single hydrophobicity predictor with amino-acid surface exposure in folded proteins ... but have excluded an important hydrophobicity metric derived from classical physical chemistry methods by Wolfenden and co-workers using water/vapor and water/cyclohexane partition distributions of amino-acid side chain mimics ... Wolfenden has persuasively argued that n-octanol is a very unsatisfactory reference solvent for multiple reasons such as the ability of side chains to drag non-repeatable amounts of water into the n-Octanol solvent ... "

Since this experimental approach is unique, we will add additional material to the manuscript to incorporate these considerations and the work of Wolfenden et. al. and the implications of the work.

We agree that the standardization of water concentration within a polar solvent (immiscible with water) capable of hydrogen bonding is difficult in addition to the uncontrolled/variable amount of water dragged into the organic solvent phase in solvent/water partition experiments. We cover these issues at length in “Hydrophobicity Revisited: A Molecular Story,” but we agree that these considerations should be covered to some extent in the present manuscript. We also point out that solvent/water concentration ratios used to calculate free energy of transfers from water to the non-polar phase as a hydrophobicity measure relevant to protein folding suffer a systemic error if the organic solvent is incapable of hydrogen bonding as is the hydrophobic “solvent like” core of folded globular proteins. Our hydrophobicity proclivity scale has no units, but rather is a normalized proclivity. We do show in figure 4 of the manuscript that our hydrophobicity proclivity scale is directly relatable to free energy of solvent-solvent transfer using a popular dGow (water to n-octanol) scale.

We also point out that characterizing the amino-acid secondary groups alone does not treat the important contribution of the free energetics of folding by peptide bonds, such as is accounted for by some researchers by treating the 20 amino-acids as guests in the center of tri-peptides in octanol to water free energy studies.

"... omitting the Wolfenden free energies is a grave oversight ... are looking for signal in a variety of data sets that have already been corrupted by similar unsuccessful attempts by previous investigators who have kludged the extant variety of multiple scales ... any useful result the present authors may have achieved is likely to be idiosyncratic and only indirectly based upon physical chemistry ..."

We will include and discuss the work of Wolfenden et. al. due to its novelty and the additional insight that it provides

We largely agree with these statements by Dr. Carter regarding the large profusion of hydrophobicity scales in the literature. Our criticisms are a bit more muted in that we recognize and discuss some of the difficulties in trying to reflect the wide range of amino-acid behaviors and the difficulties inherent in trying to define hydrophobicity as a experimentally measurable concept related to the driving forces of protein folding.

We have relied upon other experimentally measured amino-acid properties to cross check our hydrophobicity proclivity scale using methods such as Multiple Linear Regression (MLR) as can be seen in table 3. For example, we can draw a good linear correlation between certain hydrophobicity scales, such as our hydrophobicity proclivity scale, and amino-acid reverse phase (C₁₈ column) HPLC retention times (other researchers have drawn this conclusion as well).

"... The authors provide no evidence of statistical tests that might suggest significance ..."

We will provide both an F test from the MLR software we used and a Student’s T test of the MLR (8 amino-acid property class scales) correlation coefficient R. We will show these results in a new table with amino-acid properties versus the R², T and F significances. The MLR is significant where both the T test and the F test are significant at an alpha of 0.05 or better (actual alpha = 0.05*0.05 =0.0025 or better). These statistical significance tests are on the MLR results and not the individual MLR coefficients.

"... the correlations they describe ... are very likely to be successful only in proportion to the number of parameters from which their models are built ... suspect some circular logic ..."

The derivative scales as discussed above were derived from fundamental molecular properties or experimentally measured, so there is no concern on that score.

The row scale property relationships with the properties in columns 1, 2 and 4 of table two reflect paired comparisons are independent amino-acid scales.

The eight property class scales in column three of table 3 reflected in MLR pair wise comparisons in column two are independent, although 4 of the amino-acid property class scales are also reflected in in columns 1, 2 and 4 of table 3. The other 4 amino-acid property class scales come from separate relationships as described in the manuscript. The last 3 amino-acid property class scales in table 4 derived from an ANOPA analysis are based upon 49 amino-acid property scales, none of which appear in the row property scales, thus, are independent.

The 8 property class scales in table 4 result in statistically significant MLR relationships with about 50 amino-acid property scales not included in the ANOPA analysis of 49 amino-acid property scales, but the MLR back check of these 49 amino-acid property scales resulted in most of these 49 amino-acid property scales having statistically significant results.

Overall, about 150 amino-acid property scales were evaluated in the work.

"... probable that the inability of previous researchers to arrive at a single scale that predicts the accessible surface area in folded proteins arises because the problem itself is multi-dimensional ... authors describe a variety of classification schemes derived from attempts to rationalize scatterplots of amino acid properties ... authors mention useful additional classification likely related to size of AA side chain ..."

We agree that there are multiple factors involved with the free energetics of protein folding and that the available amino-acid property scales and hydrophobicity scales measure different aspects of the relationship of amino-acids to each other and to water. We believe that we should be able to related fundamental molecular properties of amino-acids and water into a meaningful larger picture.

In the “Hydrophobicity Revisited: A Molecular Story” manuscript we demonstrate that the controlling factors for our hydrophobicity proclivity energy scale (regression corrected dGow) are: #H-bonds, polar surface area, non-polar surface area, diameter of the side chain, length of the side chain, residue volume, polar area/volume and surface area/entropy. The MLR regression coefficient of determination is 99.995% and is F test significant at 6.4x10^-15.

However, the point of the present work is to find a single scale that measures a central tendency of “hydrophobicity” with the assumption that residue hydrophobicity (the contrast with water) is the dominant relationship between amino-acids for the purposes of protein alignments.

"... Read Wolfendson and Carter ... AA free energies of vapor to water partition vs. side chain volume ... Wolfenden & Carter accurately predicts Moelbert ASA with 2D plot/linear regression with AA water/vapor and AA water/cyclohexane partition confidents ..."

Agreed. We will cover this material at an appropriate level of detail given the objective of the present manuscript.

"... need description of TMATCH and homology searches ... use multitude of idiosyncratic hydrophobicity scales describing AA physical chemistry ... notably excluding the only reliable/authentic physical chemistry scale describing AA by Wolfendsen and Carter ... concern that the hydrophobicity proclivity scale produces a linear combination that correlates well with the Moelbert ASA scale by hidden , but still circular reasoning ..."

We will provide the TMATCH theory and applications manuscripts to supplement the material available for review.

We will also include an appropriate amount of material from these two TMATCH manuscripts into the current manuscript.

The Juretic average, Rose percent buried and Neumaier X hydrophobicity scales are independent of each other and the Moelbert ASA scale, therefore there can be no circular or self-referential reasoning/logic involved. Our point is that these three hydrophobicity scales are the best performers that we found and that a normalized average of these three scales will be a top performer as a robust, central tendency hydrophobicity scale.

As seen above from the results in the “Hydrophobicity Revisited: A Molecular Story” manuscript, our hydrophobicity proclivity scale can be partitioned into the sums of distinct amino-acid molecular properties and as such the scale is directly relatable to first principles, which by definition are not circular in nature.

We believe we have demonstrated that our hydrophobicity scale embodies the central tendency (i.e. first order effect) of the hydrophobicity phenomenon.

We believe that the manuscript improvements requested Dr. Jeroncic will obviate any additional concerns with circular reasoning/analysis.

"Seven references to papers where Wolfendsen and/or Carter are one of the authors"

Agreed. Thanks.
Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors allude to work they have done that demonstrates the value of the new scales they describe here, but there is essentially no coverage of this central question in this manuscript, which is disappointing and detracts substantially from the value of the paper"

This original work is described in 3 manuscripts, including this present paper. The other two manuscripts (theory and applications/results) describing the TMATCH alignment algorithm are almost ready for publication. These three manuscripts were all part of a single monograph that we had decided to partition up owing to length and the fact that the physical chemistry considerations of the current manuscript would be inappropriate for a bioinformatics journal and we needed to get this manuscript published to support another published paper (also available for review). We will incorporate into the present manuscript an appropriate level description of the TMATCH algorithm and actual results from these two manuscripts

A fourth manuscripts which goes significantly more deeply into the biophysics and thermodynamics of hydrophobicity and the broader justification for our hydrophobicity proclivity scale from these two points of view. We would like to provide you a copy of these three other manuscripts, which are in an advanced state of preparation, in answer to a number of your points. We agree with you that more summary material needs to be brought into this manuscript.

We point out the excellent linear relationship between our hydrophobicity proclivity scale and the dpx average residue depth (from water) scale in figure 5 of the manuscript. The dpx scale is derived from a suite of proteins with actual solved folded structures, hence dpx is in effect a protein structural description of hydrophobicity where the deeper on average that a residue is buried, the more hydrophobic it is. Contrast the dpx concept with the ideas of average solvent exposed are, average buried area, percent buried, etc. When the relationships in figures 3-5 (with supporting literature) are taken together, the hydrophobicity potential driving protein folding is suggested to be the traditional solvent-solvent partitioning model of protein folding where amino-acids partition between aqueous exposure and burial within the hydrophobic core of folded proteins. We believe that this traditional model of the primary driving force for globular protein folding is apt, but needs some updates and modifications that we develop in detail in the “Hydrophobicity revisited: A Molecular Story” manuscript. We suggest in the present manuscript that aqueous clathrates form about the hydrophobic area of amino-acids in an unfolded state (a structural feature) and hydrophobic surface area on folded proteins possess a surface tension like a gas bubble in water. We are suggesting that the physics driving the initial stages of protein folding is the same physics as the coalescing/condensing of bubbles of non-polar species or gas in water are essentially the same physics which cause the mutual attraction of residues based upon the residue hydrophobic area with associated aqueous clathrate. We expand upon this hypothesis in “Hydrophobicity Revisited: A Molecular Story.”

Our aim was to try to develop a scale which reliably represented the central hydrophobic tendency (average) as a central/first order effect that allowed a simple but meaningful paired comparison of amino-acids for homology relationships. Adding additional variables/properties would in our opinion have detracted from our goals of simplicity of calculation and utilization of what we believe is the primary first order effect driving protein folding.

"... this index is the result of statistical modeling from a variety of different scales of what has been called 'hydrophobicity' and derivatives ... derivative variables not described ... presentation is interesting and potentially relevant, but is deficient in its citation of the literature and in results indicating their methods or results to which they allude ... well motivated ... suffers from a variety of methodological and conceptual problems ..."

We agree that an expansion of what we meant by derivative variables should be included in the manuscript

The derivative variables concept referred to in the manuscript are primarily ratios of molecular properties or directly measureable bulk properties of amino-acids. In this way we derive normalized intrinsic thermodynamical properties versus extrinsic/bulk properties. For example, we find that the ratio of amino-acid surface area to their volume has a relationship to hydrophobicity.

Other derivative variables we used could be exemplified by estimating the water cavity volume/surface area using amino-acid volume and surface areas and some assumptions about the aqueous cavity geometry.

We will be expanding our reference list of amino-acid scales utilized to better define the amino-acid property scales we utilized in addition to hydrophobicity scales.

We will also include the work of Wolfenden et. al. into the manuscript.

"... the authors have cited just about every effort to correlate a single hydrophobicity predictor with amino-acid surface exposure in folded proteins ... but have excluded an important hydrophobicity metric derived from classical physical chemistry methods by Wolfenden and co-workers using water/vapor and water/cyclohexane partition distributions of amino-acid side chain mimics ... Wolfenden has persuasively argued that n-octanol is a very unsatisfactory reference solvent for multiple reasons such as the ability of side chains to drag non-repeatable amounts of water into the n-Octanol solvent ... "

Since this experimental approach is unique, we will add additional material to the manuscript to incorporate these considerations and the work of Wolfenden et. al. and the implications of the work.

We agree that the standardization of water concentration within a polar solvent (immiscible with water) capable of hydrogen bonding is difficult in addition to the uncontrolled/variable amount of water dragged into the organic solvent phase in solvent/water partition experiments. We cover these issues at length in “Hydrophobicity Revisited: A Molecular Story,” but we agree that these considerations should be covered to some extent in the present manuscript. We also point out that solvent/water concentration ratios used to calculate free energy of transfers from water to the non-polar phase as a hydrophobicity measure relevant to protein folding suffer a systemic error if the organic solvent is incapable of hydrogen bonding as is the hydrophobic “solvent like” core of folded globular proteins. Our hydrophobicity proclivity scale has no units, but rather is a normalized proclivity. We do show in figure 4 of the manuscript that our hydrophobicity proclivity scale is directly relatable to free energy of solvent-solvent transfer using a popular dGow (water to n-octanol) scale.

We also point out that characterizing the amino-acid secondary groups alone does not treat the important contribution of the free energetics of folding by peptide bonds, such as is accounted for by some researchers by treating the 20 amino-acids as guests in the center of tri-peptides in octanol to water free energy studies.

"... omitting the Wolfenden free energies is a grave oversight ... are looking for signal in a variety of data sets that have already been corrupted by similar unsuccessful attempts by previous investigators who have kludged the extant variety of multiple scales ... any useful result the present authors may have achieved is likely to be idiosyncratic and only indirectly based upon physical chemistry ..."

We will include and discuss the work of Wolfenden et. al. due to its novelty and the additional insight that it provides

We largely agree with these statements by Dr. Carter regarding the large profusion of hydrophobicity scales in the literature. Our criticisms are a bit more muted in that we recognize and discuss some of the difficulties in trying to reflect the wide range of amino-acid behaviors and the difficulties inherent in trying to define hydrophobicity as a experimentally measurable concept related to the driving forces of protein folding.

We have relied upon other experimentally measured amino-acid properties to cross check our hydrophobicity proclivity scale using methods such as Multiple Linear Regression (MLR) as can be seen in table 3. For example, we can draw a good linear correlation between certain hydrophobicity scales, such as our hydrophobicity proclivity scale, and amino-acid reverse phase (C₁₈ column) HPLC retention times (other researchers have drawn this conclusion as well).

"... The authors provide no evidence of statistical tests that might suggest significance ..."

We will provide both an F test from the MLR software we used and a Student’s T test of the MLR (8 amino-acid property class scales) correlation coefficient R. We will show these results in a new table with amino-acid properties versus the R², T and F significances. The MLR is significant where both the T test and the F test are significant at an alpha of 0.05 or better (actual alpha = 0.05*0.05 =0.0025 or better). These statistical significance tests are on the MLR results and not the individual MLR coefficients.

"... the correlations they describe ... are very likely to be successful only in proportion to the number of parameters from which their models are built ... suspect some circular logic ..."

The derivative scales as discussed above were derived from fundamental molecular properties or experimentally measured, so there is no concern on that score.

The row scale property relationships with the properties in columns 1, 2 and 4 of table two reflect paired comparisons are independent amino-acid scales.

The eight property class scales in column three of table 3 reflected in MLR pair wise comparisons in column two are independent, although 4 of the amino-acid property class scales are also reflected in in columns 1, 2 and 4 of table 3. The other 4 amino-acid property class scales come from separate relationships as described in the manuscript. The last 3 amino-acid property class scales in table 4 derived from an ANOPA analysis are based upon 49 amino-acid property scales, none of which appear in the row property scales, thus, are independent.

The 8 property class scales in table 4 result in statistically significant MLR relationships with about 50 amino-acid property scales not included in the ANOPA analysis of 49 amino-acid property scales, but the MLR back check of these 49 amino-acid property scales resulted in most of these 49 amino-acid property scales having statistically significant results.

Overall, about 150 amino-acid property scales were evaluated in the work.

"... probable that the inability of previous researchers to arrive at a single scale that predicts the accessible surface area in folded proteins arises because the problem itself is multi-dimensional ... authors describe a variety of classification schemes derived from attempts to rationalize scatterplots of amino acid properties ... authors mention useful additional classification likely related to size of AA side chain ..."

We agree that there are multiple factors involved with the free energetics of protein folding and that the available amino-acid property scales and hydrophobicity scales measure different aspects of the relationship of amino-acids to each other and to water. We believe that we should be able to related fundamental molecular properties of amino-acids and water into a meaningful larger picture.

In the “Hydrophobicity Revisited: A Molecular Story” manuscript we demonstrate that the controlling factors for our hydrophobicity proclivity energy scale (regression corrected dGow) are: #H-bonds, polar surface area, non-polar surface area, diameter of the side chain, length of the side chain, residue volume, polar area/volume and surface area/entropy. The MLR regression coefficient of determination is 99.995% and is F test significant at 6.4x10^-15.

However, the point of the present work is to find a single scale that measures a central tendency of “hydrophobicity” with the assumption that residue hydrophobicity (the contrast with water) is the dominant relationship between amino-acids for the purposes of protein alignments.

"... Read Wolfendson and Carter ... AA free energies of vapor to water partition vs. side chain volume ... Wolfenden & Carter accurately predicts Moelbert ASA with 2D plot/linear regression with AA water/vapor and AA water/cyclohexane partition confidents ..."

Agreed. We will cover this material at an appropriate level of detail given the objective of the present manuscript.

"... need description of TMATCH and homology searches ... use multitude of idiosyncratic hydrophobicity scales describing AA physical chemistry ... notably excluding the only reliable/authentic physical chemistry scale describing AA by Wolfendsen and Carter ... concern that the hydrophobicity proclivity scale produces a linear combination that correlates well with the Moelbert ASA scale by hidden , but still circular reasoning ..."

We will provide the TMATCH theory and applications manuscripts to supplement the material available for review.

We will also include an appropriate amount of material from these two TMATCH manuscripts into the current manuscript.

The Juretic average, Rose percent buried and Neumaier X hydrophobicity scales are independent of each other and the Moelbert ASA scale, therefore there can be no circular or self-referential reasoning/logic involved. Our point is that these three hydrophobicity scales are the best performers that we found and that a normalized average of these three scales will be a top performer as a robust, central tendency hydrophobicity scale.

As seen above from the results in the “Hydrophobicity Revisited: A Molecular Story” manuscript, our hydrophobicity proclivity scale can be partitioned into the sums of distinct amino-acid molecular properties and as such the scale is directly relatable to first principles, which by definition are not circular in nature.

We believe we have demonstrated that our hydrophobicity scale embodies the central tendency (i.e. first order effect) of the hydrophobicity phenomenon.

We believe that the manuscript improvements requested Dr. Jeroncic will obviate any additional concerns with circular reasoning/analysis.

"Seven references to papers where Wolfendsen and/or Carter are one of the authors"

Agreed. Thanks.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 15 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

15 Oct 2020

Author Response
Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors ... Continue reading
Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors allude to work they have done that demonstrates the value of the new scales they describe here, but there is essentially no coverage of this central question in this manuscript, which is disappointing and detracts substantially from the value of the paper"

This original work is described in 3 manuscripts, including this present paper. The other two manuscripts (theory and applications/results) describing the TMATCH alignment algorithm are almost ready for publication. These three manuscripts were all part of a single monograph that we had decided to partition up owing to length and the fact that the physical chemistry considerations of the current manuscript would be inappropriate for a bioinformatics journal and we needed to get this manuscript published to support another published paper (also available for review). We will incorporate into the present manuscript an appropriate level description of the TMATCH algorithm and actual results from these two manuscripts

A fourth manuscripts which goes significantly more deeply into the biophysics and thermodynamics of hydrophobicity and the broader justification for our hydrophobicity proclivity scale from these two points of view. We would like to provide you a copy of these three other manuscripts, which are in an advanced state of preparation, in answer to a number of your points. We agree with you that more summary material needs to be brought into this manuscript.

We point out the excellent linear relationship between our hydrophobicity proclivity scale and the dpx average residue depth (from water) scale in figure 5 of the manuscript. The dpx scale is derived from a suite of proteins with actual solved folded structures, hence dpx is in effect a protein structural description of hydrophobicity where the deeper on average that a residue is buried, the more hydrophobic it is. Contrast the dpx concept with the ideas of average solvent exposed are, average buried area, percent buried, etc. When the relationships in figures 3-5 (with supporting literature) are taken together, the hydrophobicity potential driving protein folding is suggested to be the traditional solvent-solvent partitioning model of protein folding where amino-acids partition between aqueous exposure and burial within the hydrophobic core of folded proteins. We believe that this traditional model of the primary driving force for globular protein folding is apt, but needs some updates and modifications that we develop in detail in the “Hydrophobicity revisited: A Molecular Story” manuscript. We suggest in the present manuscript that aqueous clathrates form about the hydrophobic area of amino-acids in an unfolded state (a structural feature) and hydrophobic surface area on folded proteins possess a surface tension like a gas bubble in water. We are suggesting that the physics driving the initial stages of protein folding is the same physics as the coalescing/condensing of bubbles of non-polar species or gas in water are essentially the same physics which cause the mutual attraction of residues based upon the residue hydrophobic area with associated aqueous clathrate. We expand upon this hypothesis in “Hydrophobicity Revisited: A Molecular Story.”

Our aim was to try to develop a scale which reliably represented the central hydrophobic tendency (average) as a central/first order effect that allowed a simple but meaningful paired comparison of amino-acids for homology relationships. Adding additional variables/properties would in our opinion have detracted from our goals of simplicity of calculation and utilization of what we believe is the primary first order effect driving protein folding.

"... this index is the result of statistical modeling from a variety of different scales of what has been called 'hydrophobicity' and derivatives ... derivative variables not described ... presentation is interesting and potentially relevant, but is deficient in its citation of the literature and in results indicating their methods or results to which they allude ... well motivated ... suffers from a variety of methodological and conceptual problems ..."

We agree that an expansion of what we meant by derivative variables should be included in the manuscript

The derivative variables concept referred to in the manuscript are primarily ratios of molecular properties or directly measureable bulk properties of amino-acids. In this way we derive normalized intrinsic thermodynamical properties versus extrinsic/bulk properties. For example, we find that the ratio of amino-acid surface area to their volume has a relationship to hydrophobicity.

Other derivative variables we used could be exemplified by estimating the water cavity volume/surface area using amino-acid volume and surface areas and some assumptions about the aqueous cavity geometry.

We will be expanding our reference list of amino-acid scales utilized to better define the amino-acid property scales we utilized in addition to hydrophobicity scales.

We will also include the work of Wolfenden et. al. into the manuscript.

"... the authors have cited just about every effort to correlate a single hydrophobicity predictor with amino-acid surface exposure in folded proteins ... but have excluded an important hydrophobicity metric derived from classical physical chemistry methods by Wolfenden and co-workers using water/vapor and water/cyclohexane partition distributions of amino-acid side chain mimics ... Wolfenden has persuasively argued that n-octanol is a very unsatisfactory reference solvent for multiple reasons such as the ability of side chains to drag non-repeatable amounts of water into the n-Octanol solvent ... "

Since this experimental approach is unique, we will add additional material to the manuscript to incorporate these considerations and the work of Wolfenden et. al. and the implications of the work.

We agree that the standardization of water concentration within a polar solvent (immiscible with water) capable of hydrogen bonding is difficult in addition to the uncontrolled/variable amount of water dragged into the organic solvent phase in solvent/water partition experiments. We cover these issues at length in “Hydrophobicity Revisited: A Molecular Story,” but we agree that these considerations should be covered to some extent in the present manuscript. We also point out that solvent/water concentration ratios used to calculate free energy of transfers from water to the non-polar phase as a hydrophobicity measure relevant to protein folding suffer a systemic error if the organic solvent is incapable of hydrogen bonding as is the hydrophobic “solvent like” core of folded globular proteins. Our hydrophobicity proclivity scale has no units, but rather is a normalized proclivity. We do show in figure 4 of the manuscript that our hydrophobicity proclivity scale is directly relatable to free energy of solvent-solvent transfer using a popular dGow (water to n-octanol) scale.

We also point out that characterizing the amino-acid secondary groups alone does not treat the important contribution of the free energetics of folding by peptide bonds, such as is accounted for by some researchers by treating the 20 amino-acids as guests in the center of tri-peptides in octanol to water free energy studies.

"... omitting the Wolfenden free energies is a grave oversight ... are looking for signal in a variety of data sets that have already been corrupted by similar unsuccessful attempts by previous investigators who have kludged the extant variety of multiple scales ... any useful result the present authors may have achieved is likely to be idiosyncratic and only indirectly based upon physical chemistry ..."

We will include and discuss the work of Wolfenden et. al. due to its novelty and the additional insight that it provides

We largely agree with these statements by Dr. Carter regarding the large profusion of hydrophobicity scales in the literature. Our criticisms are a bit more muted in that we recognize and discuss some of the difficulties in trying to reflect the wide range of amino-acid behaviors and the difficulties inherent in trying to define hydrophobicity as a experimentally measurable concept related to the driving forces of protein folding.

We have relied upon other experimentally measured amino-acid properties to cross check our hydrophobicity proclivity scale using methods such as Multiple Linear Regression (MLR) as can be seen in table 3. For example, we can draw a good linear correlation between certain hydrophobicity scales, such as our hydrophobicity proclivity scale, and amino-acid reverse phase (C₁₈ column) HPLC retention times (other researchers have drawn this conclusion as well).

"... The authors provide no evidence of statistical tests that might suggest significance ..."

We will provide both an F test from the MLR software we used and a Student’s T test of the MLR (8 amino-acid property class scales) correlation coefficient R. We will show these results in a new table with amino-acid properties versus the R², T and F significances. The MLR is significant where both the T test and the F test are significant at an alpha of 0.05 or better (actual alpha = 0.05*0.05 =0.0025 or better). These statistical significance tests are on the MLR results and not the individual MLR coefficients.

"... the correlations they describe ... are very likely to be successful only in proportion to the number of parameters from which their models are built ... suspect some circular logic ..."

The derivative scales as discussed above were derived from fundamental molecular properties or experimentally measured, so there is no concern on that score.

The row scale property relationships with the properties in columns 1, 2 and 4 of table two reflect paired comparisons are independent amino-acid scales.

The eight property class scales in column three of table 3 reflected in MLR pair wise comparisons in column two are independent, although 4 of the amino-acid property class scales are also reflected in in columns 1, 2 and 4 of table 3. The other 4 amino-acid property class scales come from separate relationships as described in the manuscript. The last 3 amino-acid property class scales in table 4 derived from an ANOPA analysis are based upon 49 amino-acid property scales, none of which appear in the row property scales, thus, are independent.

The 8 property class scales in table 4 result in statistically significant MLR relationships with about 50 amino-acid property scales not included in the ANOPA analysis of 49 amino-acid property scales, but the MLR back check of these 49 amino-acid property scales resulted in most of these 49 amino-acid property scales having statistically significant results.

Overall, about 150 amino-acid property scales were evaluated in the work.

"... probable that the inability of previous researchers to arrive at a single scale that predicts the accessible surface area in folded proteins arises because the problem itself is multi-dimensional ... authors describe a variety of classification schemes derived from attempts to rationalize scatterplots of amino acid properties ... authors mention useful additional classification likely related to size of AA side chain ..."

We agree that there are multiple factors involved with the free energetics of protein folding and that the available amino-acid property scales and hydrophobicity scales measure different aspects of the relationship of amino-acids to each other and to water. We believe that we should be able to related fundamental molecular properties of amino-acids and water into a meaningful larger picture.

In the “Hydrophobicity Revisited: A Molecular Story” manuscript we demonstrate that the controlling factors for our hydrophobicity proclivity energy scale (regression corrected dGow) are: #H-bonds, polar surface area, non-polar surface area, diameter of the side chain, length of the side chain, residue volume, polar area/volume and surface area/entropy. The MLR regression coefficient of determination is 99.995% and is F test significant at 6.4x10^-15.

However, the point of the present work is to find a single scale that measures a central tendency of “hydrophobicity” with the assumption that residue hydrophobicity (the contrast with water) is the dominant relationship between amino-acids for the purposes of protein alignments.

"... Read Wolfendson and Carter ... AA free energies of vapor to water partition vs. side chain volume ... Wolfenden & Carter accurately predicts Moelbert ASA with 2D plot/linear regression with AA water/vapor and AA water/cyclohexane partition confidents ..."

Agreed. We will cover this material at an appropriate level of detail given the objective of the present manuscript.

"... need description of TMATCH and homology searches ... use multitude of idiosyncratic hydrophobicity scales describing AA physical chemistry ... notably excluding the only reliable/authentic physical chemistry scale describing AA by Wolfendsen and Carter ... concern that the hydrophobicity proclivity scale produces a linear combination that correlates well with the Moelbert ASA scale by hidden , but still circular reasoning ..."

We will provide the TMATCH theory and applications manuscripts to supplement the material available for review.

We will also include an appropriate amount of material from these two TMATCH manuscripts into the current manuscript.

The Juretic average, Rose percent buried and Neumaier X hydrophobicity scales are independent of each other and the Moelbert ASA scale, therefore there can be no circular or self-referential reasoning/logic involved. Our point is that these three hydrophobicity scales are the best performers that we found and that a normalized average of these three scales will be a top performer as a robust, central tendency hydrophobicity scale.

As seen above from the results in the “Hydrophobicity Revisited: A Molecular Story” manuscript, our hydrophobicity proclivity scale can be partitioned into the sums of distinct amino-acid molecular properties and as such the scale is directly relatable to first principles, which by definition are not circular in nature.

We believe we have demonstrated that our hydrophobicity scale embodies the central tendency (i.e. first order effect) of the hydrophobicity phenomenon.

We believe that the manuscript improvements requested Dr. Jeroncic will obviate any additional concerns with circular reasoning/analysis.

"Seven references to papers where Wolfendsen and/or Carter are one of the authors"

Agreed. Thanks.
Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors allude to work they have done that demonstrates the value of the new scales they describe here, but there is essentially no coverage of this central question in this manuscript, which is disappointing and detracts substantially from the value of the paper"

This original work is described in 3 manuscripts, including this present paper. The other two manuscripts (theory and applications/results) describing the TMATCH alignment algorithm are almost ready for publication. These three manuscripts were all part of a single monograph that we had decided to partition up owing to length and the fact that the physical chemistry considerations of the current manuscript would be inappropriate for a bioinformatics journal and we needed to get this manuscript published to support another published paper (also available for review). We will incorporate into the present manuscript an appropriate level description of the TMATCH algorithm and actual results from these two manuscripts

A fourth manuscripts which goes significantly more deeply into the biophysics and thermodynamics of hydrophobicity and the broader justification for our hydrophobicity proclivity scale from these two points of view. We would like to provide you a copy of these three other manuscripts, which are in an advanced state of preparation, in answer to a number of your points. We agree with you that more summary material needs to be brought into this manuscript.

We point out the excellent linear relationship between our hydrophobicity proclivity scale and the dpx average residue depth (from water) scale in figure 5 of the manuscript. The dpx scale is derived from a suite of proteins with actual solved folded structures, hence dpx is in effect a protein structural description of hydrophobicity where the deeper on average that a residue is buried, the more hydrophobic it is. Contrast the dpx concept with the ideas of average solvent exposed are, average buried area, percent buried, etc. When the relationships in figures 3-5 (with supporting literature) are taken together, the hydrophobicity potential driving protein folding is suggested to be the traditional solvent-solvent partitioning model of protein folding where amino-acids partition between aqueous exposure and burial within the hydrophobic core of folded proteins. We believe that this traditional model of the primary driving force for globular protein folding is apt, but needs some updates and modifications that we develop in detail in the “Hydrophobicity revisited: A Molecular Story” manuscript. We suggest in the present manuscript that aqueous clathrates form about the hydrophobic area of amino-acids in an unfolded state (a structural feature) and hydrophobic surface area on folded proteins possess a surface tension like a gas bubble in water. We are suggesting that the physics driving the initial stages of protein folding is the same physics as the coalescing/condensing of bubbles of non-polar species or gas in water are essentially the same physics which cause the mutual attraction of residues based upon the residue hydrophobic area with associated aqueous clathrate. We expand upon this hypothesis in “Hydrophobicity Revisited: A Molecular Story.”

Our aim was to try to develop a scale which reliably represented the central hydrophobic tendency (average) as a central/first order effect that allowed a simple but meaningful paired comparison of amino-acids for homology relationships. Adding additional variables/properties would in our opinion have detracted from our goals of simplicity of calculation and utilization of what we believe is the primary first order effect driving protein folding.

"... this index is the result of statistical modeling from a variety of different scales of what has been called 'hydrophobicity' and derivatives ... derivative variables not described ... presentation is interesting and potentially relevant, but is deficient in its citation of the literature and in results indicating their methods or results to which they allude ... well motivated ... suffers from a variety of methodological and conceptual problems ..."

We agree that an expansion of what we meant by derivative variables should be included in the manuscript

The derivative variables concept referred to in the manuscript are primarily ratios of molecular properties or directly measureable bulk properties of amino-acids. In this way we derive normalized intrinsic thermodynamical properties versus extrinsic/bulk properties. For example, we find that the ratio of amino-acid surface area to their volume has a relationship to hydrophobicity.

Other derivative variables we used could be exemplified by estimating the water cavity volume/surface area using amino-acid volume and surface areas and some assumptions about the aqueous cavity geometry.

We will be expanding our reference list of amino-acid scales utilized to better define the amino-acid property scales we utilized in addition to hydrophobicity scales.

We will also include the work of Wolfenden et. al. into the manuscript.

"... the authors have cited just about every effort to correlate a single hydrophobicity predictor with amino-acid surface exposure in folded proteins ... but have excluded an important hydrophobicity metric derived from classical physical chemistry methods by Wolfenden and co-workers using water/vapor and water/cyclohexane partition distributions of amino-acid side chain mimics ... Wolfenden has persuasively argued that n-octanol is a very unsatisfactory reference solvent for multiple reasons such as the ability of side chains to drag non-repeatable amounts of water into the n-Octanol solvent ... "

Since this experimental approach is unique, we will add additional material to the manuscript to incorporate these considerations and the work of Wolfenden et. al. and the implications of the work.

We agree that the standardization of water concentration within a polar solvent (immiscible with water) capable of hydrogen bonding is difficult in addition to the uncontrolled/variable amount of water dragged into the organic solvent phase in solvent/water partition experiments. We cover these issues at length in “Hydrophobicity Revisited: A Molecular Story,” but we agree that these considerations should be covered to some extent in the present manuscript. We also point out that solvent/water concentration ratios used to calculate free energy of transfers from water to the non-polar phase as a hydrophobicity measure relevant to protein folding suffer a systemic error if the organic solvent is incapable of hydrogen bonding as is the hydrophobic “solvent like” core of folded globular proteins. Our hydrophobicity proclivity scale has no units, but rather is a normalized proclivity. We do show in figure 4 of the manuscript that our hydrophobicity proclivity scale is directly relatable to free energy of solvent-solvent transfer using a popular dGow (water to n-octanol) scale.

We also point out that characterizing the amino-acid secondary groups alone does not treat the important contribution of the free energetics of folding by peptide bonds, such as is accounted for by some researchers by treating the 20 amino-acids as guests in the center of tri-peptides in octanol to water free energy studies.

"... omitting the Wolfenden free energies is a grave oversight ... are looking for signal in a variety of data sets that have already been corrupted by similar unsuccessful attempts by previous investigators who have kludged the extant variety of multiple scales ... any useful result the present authors may have achieved is likely to be idiosyncratic and only indirectly based upon physical chemistry ..."

We will include and discuss the work of Wolfenden et. al. due to its novelty and the additional insight that it provides

We largely agree with these statements by Dr. Carter regarding the large profusion of hydrophobicity scales in the literature. Our criticisms are a bit more muted in that we recognize and discuss some of the difficulties in trying to reflect the wide range of amino-acid behaviors and the difficulties inherent in trying to define hydrophobicity as a experimentally measurable concept related to the driving forces of protein folding.

We have relied upon other experimentally measured amino-acid properties to cross check our hydrophobicity proclivity scale using methods such as Multiple Linear Regression (MLR) as can be seen in table 3. For example, we can draw a good linear correlation between certain hydrophobicity scales, such as our hydrophobicity proclivity scale, and amino-acid reverse phase (C₁₈ column) HPLC retention times (other researchers have drawn this conclusion as well).

"... The authors provide no evidence of statistical tests that might suggest significance ..."

We will provide both an F test from the MLR software we used and a Student’s T test of the MLR (8 amino-acid property class scales) correlation coefficient R. We will show these results in a new table with amino-acid properties versus the R², T and F significances. The MLR is significant where both the T test and the F test are significant at an alpha of 0.05 or better (actual alpha = 0.05*0.05 =0.0025 or better). These statistical significance tests are on the MLR results and not the individual MLR coefficients.

"... the correlations they describe ... are very likely to be successful only in proportion to the number of parameters from which their models are built ... suspect some circular logic ..."

The derivative scales as discussed above were derived from fundamental molecular properties or experimentally measured, so there is no concern on that score.

The row scale property relationships with the properties in columns 1, 2 and 4 of table two reflect paired comparisons are independent amino-acid scales.

The eight property class scales in column three of table 3 reflected in MLR pair wise comparisons in column two are independent, although 4 of the amino-acid property class scales are also reflected in in columns 1, 2 and 4 of table 3. The other 4 amino-acid property class scales come from separate relationships as described in the manuscript. The last 3 amino-acid property class scales in table 4 derived from an ANOPA analysis are based upon 49 amino-acid property scales, none of which appear in the row property scales, thus, are independent.

The 8 property class scales in table 4 result in statistically significant MLR relationships with about 50 amino-acid property scales not included in the ANOPA analysis of 49 amino-acid property scales, but the MLR back check of these 49 amino-acid property scales resulted in most of these 49 amino-acid property scales having statistically significant results.

Overall, about 150 amino-acid property scales were evaluated in the work.

"... probable that the inability of previous researchers to arrive at a single scale that predicts the accessible surface area in folded proteins arises because the problem itself is multi-dimensional ... authors describe a variety of classification schemes derived from attempts to rationalize scatterplots of amino acid properties ... authors mention useful additional classification likely related to size of AA side chain ..."

We agree that there are multiple factors involved with the free energetics of protein folding and that the available amino-acid property scales and hydrophobicity scales measure different aspects of the relationship of amino-acids to each other and to water. We believe that we should be able to related fundamental molecular properties of amino-acids and water into a meaningful larger picture.

In the “Hydrophobicity Revisited: A Molecular Story” manuscript we demonstrate that the controlling factors for our hydrophobicity proclivity energy scale (regression corrected dGow) are: #H-bonds, polar surface area, non-polar surface area, diameter of the side chain, length of the side chain, residue volume, polar area/volume and surface area/entropy. The MLR regression coefficient of determination is 99.995% and is F test significant at 6.4x10^-15.

However, the point of the present work is to find a single scale that measures a central tendency of “hydrophobicity” with the assumption that residue hydrophobicity (the contrast with water) is the dominant relationship between amino-acids for the purposes of protein alignments.

"... Read Wolfendson and Carter ... AA free energies of vapor to water partition vs. side chain volume ... Wolfenden & Carter accurately predicts Moelbert ASA with 2D plot/linear regression with AA water/vapor and AA water/cyclohexane partition confidents ..."

Agreed. We will cover this material at an appropriate level of detail given the objective of the present manuscript.

"... need description of TMATCH and homology searches ... use multitude of idiosyncratic hydrophobicity scales describing AA physical chemistry ... notably excluding the only reliable/authentic physical chemistry scale describing AA by Wolfendsen and Carter ... concern that the hydrophobicity proclivity scale produces a linear combination that correlates well with the Moelbert ASA scale by hidden , but still circular reasoning ..."

We will provide the TMATCH theory and applications manuscripts to supplement the material available for review.

We will also include an appropriate amount of material from these two TMATCH manuscripts into the current manuscript.

The Juretic average, Rose percent buried and Neumaier X hydrophobicity scales are independent of each other and the Moelbert ASA scale, therefore there can be no circular or self-referential reasoning/logic involved. Our point is that these three hydrophobicity scales are the best performers that we found and that a normalized average of these three scales will be a top performer as a robust, central tendency hydrophobicity scale.

As seen above from the results in the “Hydrophobicity Revisited: A Molecular Story” manuscript, our hydrophobicity proclivity scale can be partitioned into the sums of distinct amino-acid molecular properties and as such the scale is directly relatable to first principles, which by definition are not circular in nature.

We believe we have demonstrated that our hydrophobicity scale embodies the central tendency (i.e. first order effect) of the hydrophobicity phenomenon.

We believe that the manuscript improvements requested Dr. Jeroncic will obviate any additional concerns with circular reasoning/analysis.

"Seven references to papers where Wolfendsen and/or Carter are one of the authors"

Agreed. Thanks.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (1)

Version 2

VERSION 2 PUBLISHED 15 Oct 2020

Revised

Comment

Version 1

VERSION 1 PUBLISHED 21 Oct 2015

Discussion is closed on this version, please comment on the latest version above.

Author Response 05 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

05 Oct 2020

Author Response

I am one of the authors of this paper. I wanted to update some news. The revision 2 of this paper, which reflects the reviewer comments, is now in press ... Continue reading I am one of the authors of this paper. I wanted to update some news. The revision 2 of this paper, which reflects the reviewer comments, is now in press for F1000research. There is another paper on the subject of hydrophobicity published in ChemRxiv that is the follow up paper to this one:

https://chemrxiv.org/authors/David_Cavanaugh/8853095
I am one of the authors of this paper. I wanted to update some news. The revision 2 of this paper, which reflects the reviewer comments, is now in press for F1000research. There is another paper on the subject of hydrophobicity published in ChemRxiv that is the follow up paper to this one:

https://chemrxiv.org/authors/David_Cavanaugh/8853095
Competing Interests: No competing interests were disclosed. Close
Report a concern
Discussion is closed on this version, please comment on the latest version above.

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 15 Oct 20			read
Version 1 21 Oct 15	read	read

Charles Carter, University of North Carolina at Chapel Hill, Chapel Hill, USA
Ana Jerončić , University of Split School of Medicine, Split, Croatia
Ruchi Lohia, University of Pennsylvania, Philadelphia, USA

Comments on this article

All Comments(1)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

21 Oct 2024 | for Version 2

Ruchi Lohia, University of Pennsylvania, Philadelphia, Pennsylvania, USA

7 Views Cite this report Responses(0)

Approved With Reservations

Summary
The manuscript addresses a significant challenge in bioinformatics: improving multiple sequence alignment through enhanced amino acid similarity metrics. The authors introduce a "hydrophobic proclivity index," which they argue can enhance our ability to identify homologous sequences, particularly in cases where traditional methods struggle due to sequence divergence.
General Comments

Motivation and Relevance: The problem of aligning distantly related sequences is well-articulated. The authors suggest that understanding amino acid physical chemistry in relation to protein structure can improve homology searches, which is a compelling premise for their work.
Utility of the Hydrophobicity Scale: The authors propose that their hydrophobicity scale will be beneficial for protein alignments. However, the manuscript lacks substantial evidence demonstrating the practical effectiveness of this scale in comparison to established hydrophobicity metrics. Specific examples of proteins where the hydrophobicity scale outperforms existing methods would greatly enhance the manuscript's claims. Additionally, it would be important to consider that the hydrophobicity of amino acids is context-dependent, influenced by nearby residues in the 2D sequence. Therefore, it raises the question of whether a slightly better absolute hydrophobic score would significantly impact alignment scores.

Specific Comments

Text, Figure, and Equation Clarifications:
- In Figure 1, it is unclear from where the authors obtained the amino acid solvent-exposed surface area.
- Most figures, including Table 3, lack appropriate captions that clearly describe what the x and y axes represent. Informative captions are essential for reader comprehension and should detail the significance of the data presented.
- The statement, “This scatter is indicative of the increase in water amino acid interaction and of the difficulty of accurately calculating the contribution of any particular residue,” requires clarification. What exactly does the author mean by this?
Titles and Flow:
- Titles such as "1D ANOPA analysis" do not provide adequate context. These titles should be preceded by a paragraph explaining their relevance and methodology to ensure better narrative flow throughout the manuscript.
Hydrophobicity Proclivities:
- The term "amino-acid hydrophobicity proclivities" used in Figure 2 should be clearly defined. Readers need to understand what this measure entails and how it was derived to fully appreciate its application in the context of the manuscript. The authors never reference Equation 1; is this the definition of amino-acid hydrophobicity proclivities?
Application in TMATCH Algorithm:
- The manuscript indicates that the TMATCH algorithm uses this hydrophobicity scale to improve protein alignments. However, there is a lack of empirical evidence demonstrating how the algorithm performs with this scale in practice. The authors do not provide specific examples or case studies where the TMATCH algorithm, utilizing their hydrophobicity scale, has been applied successfully.
Evidence of Effectiveness:
- While the authors suggest that the new scale will enhance alignment performance, there is no concrete data or comparative results presented in the manuscript to substantiate this claim. For the manuscript to be convincing, it should include results from tests comparing the performance of the TMATCH algorithm with their hydrophobicity scale against established methods.

Conclusion
In conclusion, the manuscript presents an interesting approach to improving protein sequence alignment through the introduction of a hydrophobic proclivity index. However, it suffers from several methodological and conceptual issues that need to be addressed to enhance its impact and credibility. The authors should either provide empirical evidence demonstrating better alignment of proteins using their hydrophobicity scores or consider changing the title and steering the paper towards the development of a new hydrophobicity measure. Additionally, clarifying terminology and improving figure presentations will significantly strengthen the manuscript.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Molecular dynamic simulations, transcriptomics

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

34 Views

25 Jul 2016 | for Version 1

Ana Jerončić , Department of Research in Biomedicine and Health, University of Split School of Medicine, Split, Croatia

34 Views Cite this report Responses(1)

Not Approved

What was the reasoning behind the assumption that the property classes of amino acid residues identified through ‘linear-clusters’ represent “real properties of amino-acids within context of folded proteins”. Or there was no assumption and the fact that the regression models of all hydrophobicity scales exhibited the highest R2 values when these property classes were used as independent variables actually justified such interpretation. If latter was the case, such reasoning would not be justified (see comments on multiple linear regression analysis)
The authors should describe the method they used to identify linear clusters on a plot (i.e. visual identification, followed by analysis of amino acid physicochemical/biochemical properties in clusters and regression-analysis of clusters that confirmed the cluster status or something else)
How did the authors end up with the final set of 6 (or 3?) scatter plots from which they have derived their property classes? Were “linear-clusters” identified only in these plots or did the authors select the final plots based on relevance of plotted variables in folded proteins. If latter was true – what was the criteria they used to identify the most relevant scatter-plots
All property classes including the classes #6, #7, and #8 should be precisely defined. The description “classes #6, #7 and #8 were derived from 49 fundamental amino acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA)” is unacceptable. Which of 49 fundamental amino acid properties and their derived scales were used, and how, to identify property classes from #6 to #8.
Scatter-plots that were used for generation of classes from 5 to 8 should be shown.

5 - Comments on the reporting style

The Introduction section is quite short – the authors should elaborate more on relevant physico-chemical properties of amino acids and their importance in protein folding in this section.
There are parts of the Introduction in the Results section (the first paragraph) and the Discussion section (alignment matrices).
The hydrophobic scale that was chosen as the optimal one was normalized average of three published hydrophobicity scales that were found “most robust in correlation analyses” with robustness vaguely defined in the Methods section as associations to “multiple fundamental properties of the 20 natural amino acids using multi-variate statistical procedures, thermodynamics and biophysical chemistry considerations”. It is just latter, at the very end of the Results section that one can find out that “robust” scales are actually those whose MLR models using property classes as dependent variables exhibited highest R2 values. The Methods section should be written more clearly.
Table 2 – The labelling of Table 2 should be improved as authors keep explaining what is presented in which column of the Table 2 throughout the Results section.

Competing Interests

No competing interests were disclosed.

Respond to this report

Responses (1)

Author Response

15 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

Ana Jeroncic

"... Ambiguously defined classes #6, #7 and #8 were derived from 49 fundamental amino-acid properties and derived scales that are based upon an analysis with Analysis of Patterns (ANOPA) ..."

The 49 AA properties are represented by:
- 2 sequence frequency scales
- 6 secondary structure propensity scales
- 8 hydrophobicity scales
- 4 free energy scales (in water, protein folding/unfolding)
- 9 HPLC retention time scales
- 4 probability of an AA inside a folded protein core or on the outside
- 7 molecular property scales
- 9 physical (measured) property scales
The ANOPA method is a pattern recognition, pattern projection method. The ANOPA procedure projects n-space pattern point/vectors into a 3 dimensional object which is a cylinder. The axis of the cylinder, which is termed the relation vector, is formed from the pattern point centroid (averages of each AA property for all points) to an outgroup average point (averages of each AA property for a pair of selected points). The outgroup pair are selected on the basis of a histogram of the pattern point Euclidian distances from the pattern point centroid. Two pattern point distances are calculated with respect to the relation vector from a projection of each point onto the relation vector yielding the distance along the relation vector and the distance from the relation vector. The angle of rotation of each pattern point projection onto the relation vector is calculated. Thus, a cylindrical coordinate system is formed which is then converted into rectilinear coordinates. The X’ and Y’ rectilinear points are formed from each pattern point’s pattern projection distance times the cosine and sine respectively of the pattern projection angle of rotation. The Z’ point is simply the pattern projection intersection distance along the relation vector.
There is a relatively flat and linear structure to the higher order correlation structure in the 49 AA property pattern space. The 3D ANOPA X’ coordinate has a strong linear relationship (R²=95.13%) with the angle of rotation Ar pattern point/vector relation vector projection vectors. Similarly, the 3D ANOPA Y’ coordinate has a strong linear correlation (R²=99.86%) with the d2 distances of the pattern points/vectors from the relation vector.
Each of the 3D ANOPA coordinates have meaningful interpretations/associations with amino-acid properties. The X’ coordinates have 3 good linear series when scatter plotted with AA refractivity [179] and AA mass/(area/volume) [192], and it has 4 linear series with the Tanford hydrophobicity scale [197]. The AA refractivity scale has a strong linear relationship with the AA mass/(area/volume) derived scale, except that Glycine is off the regression line. The Y’ coordinates have 2 linear series when scatter plotted against the DNA/RNA numerically encoded (0-63) lexical table (UCAG, UCTG) averages and 3 linear series when scatter plotted against the AA property derived scale 2*H-B-DB. In the latter scale H is our hydrophobicity proclivity scales, B is a beta sheet proclivity scale derived from several published statistical scales and DB is the double bend proclivity scale derived from several literature statistical scales. The 2*H-B-DB scale reliably predicts (a number of proteins were sampled) the presence (or propensity) of alpha helices (value>0) and beta sheets (value<0) with sinusoidal trends (running average of 3) as seen with the primary sequence AA ordinal numbers plotted along the X axis. Since the Y’ coordinate has a strong relationship with the 2*H-B-DB scale, which itself has a strong relationship with protein secondary structure we see that the Y’ coordinate also has a relationship with protein secondary structure. We also can infer that there is a definite relationship between protein secondary structure and the DNA code, which in itself is related to AA hydrophobicity that we have documented in the hydrophobicity paper and the fact that the secondary structure proclivity scale by definition is related to hydrophobicity. Finally, the 3D ANOPA coordinate Z’ strongly correlates with all of the AA hydrophobicity scales (and some of the AA HPLC scales) in our database, both those used in the 49 property ANOPA analysis and those that were not used in this analysis.

"... MLR R² value represents a rigorous test for a robust, high performance hydrophobicity scale ... The rationale for such an assumption was that for all tested hydrophobicity scales, the R² values of the Multiple Linear Regression (MLR) were higher than the R² values of the simple binary regression models ... I disagree with the authors on the MLR R² rationale as I have concerns about the appropriateness of the data analysis"

The argument presented is not that the MLR regression was superior to that of the binary regression because of higher correlation coefficients, although possibly that may be true, but rather that the AA property class MLR represented more information in that it represented the behaviors of amino-acids in a larger series of contexts and properties. Also, the argument implied by having a MLR with the 8 property class scales is that for each context the amino-acids partition into sub-sets and that these different context sub-sets join together to determine amino-acid behavior in protein folding, interactions with water, interactions with membranes, secondary structure and electronic behaviors associated with AA-AA interactions. Moreover, we argue that within any given regression that the higher the correlation coefficient, the stronger the evidence for the superiority of a given hydrophobicity scale within that regression criterion. Top performance of a hydrophobicity scale within several regression relationship criterions based upon different property information leads to a performance/robustness conclusion based upon the consilience of the evidence.
We use the coefficient of determination (R²) rather than the correlation coefficient because it is a much more conservative statistic than the correlation coefficient (R) and can meaningfully be interpreted as the percent linearity between the dependent variable and the X independent variable(s) in the regression. Concerns might be raised about two points in the property class MLR, which are the possibility of inter-correlation between X variables and the loss of degrees of freedom in the regression statistics (over-fitting). We deal with these concerns in several fashions. First off, we calculate a T test significance only for the regression itself and not for any individual X variable regression coefficient since the Y variable regression performance itself is what is being measured. Secondly, the correlation coefficient T test (null hypothesis being that R=0) number of degrees of freedom (20) is reduce by 2 for each X variable coefficient in the regression and by 1 for the Y axis intercept. In the AA property class MLR we calculate the resulting degrees of freedom in the regression and correlation coefficient T test as 20-(2*8+1) =3. We allow for some dimensionality reduction due to inter-X pair correlation as long as the dimensionality reduction (i.e. percent reduction in the number of states with discrete variables) is not large and new information is being brought forth between each pair of X variables. Where there is no new information adduced by inclusion of a X variable we find that the magnitude of the correlation coefficients are reduced and significantly reduced by the squaring process to get the coefficient of determination. Furthermore, we demand that any set of property class vectors produce large significant coefficient of determinations with a large number of disparate AA property scales in our database. The latter criterion produces a very, very high degree of statistical confidence that the collection of property class scales can serve as a highly robust basis set for MLR vector regressions that can evaluate the robustness/significance of individual AA property scales.

"... the reporting in the manuscript should be substantially improved ..."

We agree and are revising the manuscript accordingly.

"Throughout the paper the description of the MLR models is very confusing ... it is not clear what models were actually run ... what was the dependent and independent variables ... what was the estimated regression coefficients and their statistical significance ..."

We have addressed these comments to some extent above. Additionally we will say that we will move all of the related discussion of regression methodology into the methods section where the discussion will make more sense. The key point is that the MLR X’s are not single variables, but rather are column vectors with each column vector having values for each amino-acid. In this context we equate the concept of a scale with the concept of a column vector. Having made this distinction, the concept of a Multi-Linear Regression (MLR) is still valid and simply represents a broader mathematical context. Our use of the MLR methodology with the 8 property class scales is to evaluate a correlation relationship as whole, so the entire regression correlation is what is used/important and the statistical significance of individual variable regression coefficients is meaningless. If the MLR method was being used to fit a variable for the purpose of extrapolating new Y values with scale values outside of the basis set, such as with new amino-acids, then the statistical significance of variable coefficients would be germane.

"eight property classes ... vector sets of clusters/linear families of curves in multiple linear regression relationships between two (or more) amino-acid physico-chemical properties .. which is very confusing ... seems like more independent variables are being added than the eight property class variables ..."

One must keep in mind that the MLR method that we are using uses vectors (scales) of dimensionality of 20 (one for each AA) and not individual variables. When doing AA property scale pair scatter plots between each property scale selected for our database we often find clustering behavior which indicates that the amino-acids are partitioning into distinct sub-sets and each amino-acid can be assigned a numerical value for the ordinal number of the sub-set into which it partitions. The clustering relationships uncovered for any particular pair of two AA property scales is generally driven by molecular geometry, secondary group geometry, numbers and types of atoms/inter-atomic bonding in the secondary group, molecular size, molecular mass, molecular volume, molecular surface area and entropy distribution about the molecule. No given pair of AA property scale relationships represent the full range of these molecular properties and the nature of the interaction with water and cellular membranes. The key point though is that there are distinct sets and sub-sets, which when numerically encoded can jointly describe each amino-acid as a row vector of some dimensionality M. We have found through extensive analysis that M is a very good size. Within each property pair scatterplot where clustering occurs, we can have different patterns such as unstructured clusters, multiple quasi-parallel linear clusters or multiple linear clusters that intersect at some point (often at glycine or alanine). Generally, there is a geometric ordering that allows the assignment of a meaningful ordinal number. To reiterate, a property class scale (column vector) represents a relationship between property scales and the values assigned within the scale to each amino-acid are their respective sub-set/cluster ordinal number. Property class scales assembled in this way can be used as one of the basis set vectors to be used to evaluate the performance and reliability of any individual amino-acid property scale. We also need to note that the three property class vectors defined between the 3D ANOPA coordinate system planes represent the correlation based dimensionality reduction from a pattern space of dimension 49 to a pattern space of dimension 3.

"... majority of these property classes were actually ordinal variables ... more actual variables should have been added owing to the introduction of dummy variables ... sample size of these models too small to estimate model parameters precisely ... "

In fact all of the numbers within the 8 property class scales are ordinal numbers representing real facts of clustering and the distinct geometrical ordering of the clusters. The practice of using assigned numbers to sets/sub-sets of objects, especially if the numbers can represent some form of ordering, is widely used in multi-variate statistical procedures such as MLR regression, discriminant analysis, Principal Component Analysis (PCA)and procedures cognate to PCA, as is found in fielids such as numerical taxonomy, computational biology and other fields that numerically encode state data. The authors have enjoyed great success in using discrete data to represent state data in a number of contexts over a number of years. Since the MLR regressions are not used for predictive purposes, but rather to express the degree of correlation and relatedness as expressed in the correlation coefficient and its T test based P value, the concept if regression coefficient statistical significance is a non-sequitur.

"... R² quite inflated ... over fitted model ... large number of independent variables that the authors did not adjust for compared to simple linear binary "regression models ..."

The authors would point out that additional variables do not inflate R² values unless there is a very serious inter-variable correlation between a large sub-set of the X variables. To the contrary, we find that working with the amino-acid property scales that we have assembled that there is generally an artificial deflation of R² relationships. Therefore, we do not agree that the introduction of variables into an MLR artificially inflates a MLR correlation versus a binary regression correlation. We adjust for these concerns in three ways: use of the coefficient of determination statistic is very conservative and we screen out a number of potentially significant relationships thereby; we use a T test to assess the statistical significance of the whole regression relationship and reduce the number of degrees of freedom of the T test accordingly as described above; we uncover the specific physical meanings reflected in each property scale relationship cluster to ensure that the physical-chemical factors at play are distinct and bring new information to the table.

"... problem of multicollinearity between independent variables ... inflated MLR R² ... table 2 PC2 and PC4 had a Kendall tau coefficient of 0.82, P<0.001"

We note that property classes 1 and 2 devolve from the relationship between the hydrophobicity proclivity scale and specific absolute entropy and that property classes 3 and 4 devolve from the relationship between the hydrophobicity proclivity scale and the specific volume as seen in figures 2 and 1 respectively. Both the specific absolute entropy and the specific volume are thermodynamic intrinsic property scales formed by the normalization of the amino-acid molecular volumes and molecular absolute entropy by their molecular masses, which create two secondary or derived scales from fundamental molecular properties that have the molecular mass in common. Furthermore, both the total entropy and the volume are related in some way to the molecular mass, so dividing by the molecular mass in both cases offset that dependence and leaves an intrinsic average property. Property class 2 partitions into two linear subgroups on the basis of polar and non-polar amino-acids. Property class 4 partitions into 4 linear subgroups on the basis of size, relative amounts of non-polar area and the presence/absence of a relatively strong or weak polar group. While property class 4 does reflect the polar/non-polar distinction to some extent, it does not reflect a strong and clear partition of AA by polarity or non-polarity, therefore new information is introduced by the addition of property class 4. We note that there are 2*4 possible combinations of states between property classes 2 and 4, but in fact there are actually 5 states formed giving a dimensionality reduction of 1-5/8 =37.5%, which we consider to be acceptable. We do not see strong enhancement of MLR correlation coefficients with any statistically significant AA property regression with the 8 property class scales. Given the nature of the interrelationship of fundamental molecular properties based upon AA molecular size, molecular geometry, entropy/entropy density, mass/mass density, surface area or electronic configuration/hybridization, some inter-property class correlation is inevitable and must be tolerated to make any progress in understanding amino-acid physico-chemical properties and protein folding. See the additional relevant discussion above.

"... main result of this paper ... based upon assumption that amino acid property classes forming linear clusters ... within context of folded proteins ... is not a valid assumption"

We agree that the 8 property class scales are one of the major results of this study, but we do not agree that it is the only one. We do not assume that the 8 amino-acid property class scales are a major determinate of protein folding per se, but rather we assume the fundamental molecular properties of each amino-acid in the context of a folded protein environment or the context of contact with a water environment are what determine protein folding. The assemblage of the 8 amino-acid property class scales that we report reflects a wide spread clustering pattern (sometimes in linear series) in 2D scatterplots and 3D ANOPA scatterplots between two or more scales. The 8 amino-acid property class scales we have assembled for our present study reflect the best of the clustering patterns forming unique subsets of amino-acids in different contexts/molecular properties with regard to hydrophobicity scales, but also with a number of other amino-acid physico-chemical property scales, secondary structure statistical propensity scales and molecular property scales. To the extent that any given hydrophobicity scales correlate with, and were determined from folded proteins, reflecting secondary/super-secondary structure of folded proteins, then the 8 property classes that we report are relevant to folded proteins. The 8 property classes that we report have statistically significant (P<0.05) MLR regressions with 124 amino-acid property scales in our assembled database, which implies that the amino-acid clustering behaviors reflected in these numerical scales can largely be reproduced with the clustering behaviors of the amino-acids. While the idea of linking of hydrophobicity to protein folding is an important idea, the finding that nature has selected the 20 natural amino-acids on the basis of distinct sub-sets/clustering in several different joint property dimensions is a significant finding in and of itself, perhaps leading to criteria for the engineered selection of un-natural amino-acids that could provide unique properties to active enzymes/proteins.

"... R² is an overused statistic for linear regression analysis and additional metrics are required to get the whole picture ...R² actually represents the square of the Pearson correlation coefficient"

We do not agree with the contention that the R² statistic is trivial and meaningless from overuse. Rather we point out that R² is a more conservative statistic than is the correlation coefficient and rolls off much more quickly, as well as having an important interpretation as the percent linearity that directly indicates the relative scatter of the regression calculated Y’s with respect to the actual Y’s.

"... A reader should know precisely which scatter plots were screened for a 'linear-cluster' pattern ... which hydrophobic scales and other physico-chemical properties of amino-acids were used for the scatter plots ... how many scales used"

There were approximately 175 amino-acid property scales used for our study, of which maybe 15-20 were derivative scales prepared from ratios of fundamental molecular properties such as volume and mass, which would represent intrinsic versus extrinsic scales in the thermodynamical sense. About another 10-15 scales represent averages of literature reported amino-acid property scales. Please note that linear clusters were not the other patterns seen. There were some non-linear patterns and simple clusters for example. Also note that the scales selected represent a very wide range of amino-acid properties including secondary structure statistical propensity scales, fundamental molecular properties like surface area/mass/volume, bulk properties such as index of refraction/melting point/pK-C, HPLC retention times, free energy in water, solvent/water partitioning, average fraction occurring in proteins, average amino-acid burial in proteins/amino-acid exposure wot water in folded proteins, hydrophobicity scales, NMR parameters, Rf parameters, several literature Principal Component Analysis/Factor Analysis parametric scales and other miscellaneous property scales. Of the 175 amino-acid property scales (and about ~200 total scales) 124 scales had statistically significant (P<0.05) MLR regression correlation with the 8 property classes that we have reported.

"... the eight amino-acid property classes most important novelty of this paper ... what is the reasoning that AA property classes represent real AA physico-chemical properties in the context of folded proteins ... rather was the assumption that high R² from MLR justifies assertion of relatedness to physico-chemical properties ... a high R² not a valid basis for assumption ..."

We agree that the 8 amino-acid property scales are novel and an important finding. We disagree with the comments on the MLR regression correlation and assert that high MLR R² values are both statistically and physically valid measures for reasons discussed above. The high and statistically significant coefficients of determinations are spread over many types of amino-acid physico-chemical property scales, including structural statistical propensity scales and hydrophobicity scales. We find that there is a strong relationship between amino-acid HPLC retention times and a number of the higher quality hydrophobicity scales in our assembled database, hence there is a direct link between a measureable amino-acid bulk property and amino-acid partitioning behavior in the structure of folded proteins.

"... the authors should describe the method that they used to identify linear clusters on a plot ... follow up analysis of physico-chemical/biochemical properties of clusters ... regression analysis ..."

We did describe most of the points commented here in the original manuscripts, but agree that the descriptions need to be expanded upon. There were a number of linear clustering patterns, amongst other patterns, that can be described as multiple quasi-parallel series, multi-linear series that intersect at a given amino-acid serving as a quasi or virtual origin (e.g. Karplus 1997), quasi-parallel cross hatched linear patterns (e.g. figures 1 & 2 in the manuscript). We ran, although we did not report a regression correlation coefficient on each series to assess its relative linearity. Where we found what we discerned as significant relationships, we evaluated those putative relationships from the point of view of physical reality as expressed by fundamental molecular properties and potential relationships with water, especially with aqueous clathrate membranes.

" ... how did the authors end up with the final set of scatter plots from which AA property classes were derived ... were they linear clusters ... did authors select scatterplots on basis of plotted variable relevance ... what was the criteria used to select most relevant scatterplots ..."

Our primary and starting premise was to look for single linear or non-linear patterns in all amino-acids in the scatterplot of any two given scales. Along the way we found that multiple linear series occurring in scatterplots between many pairs of amino-acid properties, which was a pattern we could not ignore. Many of these multi-family linear series ranged between clearly discernable to very high quality; the latter end of the range being what we concentrated on. We also cross checked to see if sister AA property scales in the same class such as hydrophobicity resulted in similar patterns, such as which we see in figures 1 and 2 and in table 2 of the manuscript. Once the linear/non-linear (single and multi-family) patterns were found a detailed review of these patterns were made to find underlying physico-chemical and biological reasons as well as statistical generalizations. The most relevant scatter plots were selected based upon their quality (visual and linear/non-linear regression) and explanatory power for protein structure and function and reliability/specificity for protein alignments.

"... all property classes including #6, #7 & #8 should be precisely defined ... derived from 49 fundamental amino-acid properties/derived scales ... Analysis of Patterns (ANOPA) ... need expanded description ..."

We have defined property classes #1 through #5, but we will expand upon these definitions. We will define property classes #6-#8 with plots and physical explanations. We will describe the ANOPA procedure and explain the specific 49 amino-acid property scale ANOPA analysis used in this study. The 49 amino-acid property scales used in the ANOPA analysis are property scales gleaned from the literature and no derivative scales were used for this portion of the overall analysis. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

" ... show scatter plots for property classes #5 through #8 ..."

We agree to revise the manuscript accordingly.

"... need expanded introduction ... expand more on AA physico-chemical properties relevant to protein folding ..."

We agree to modify the manuscript according to these comments, except that we note that he scope of this paper is to define our hydrophobicity scale and its application to protein alignments, whereas we take up the challenge of applying this work to protein folding with an extensive analysis in our follow up paper “Hydrophobicity Revisited: a Molecular Story.” This latter manuscript is in an advanced state of preparation and can be provided for review, which is recommended since the treatment of the material in this manuscript is way beyond what we can put into the current manuscript you have reviewed. We will be putting in a couple of new tables to define the 49 amino-acid properties used in the ANOPA analysis and the 124 statistically significant (by regression correlation) amino-acid property scales

"... there is material in the results section (first paragraph) and discussion section (alignment matrices) that belong in the introduction section ..."

We will revise the manuscript per these comments.

"... hydrophobic scale chosen as optimal ... normalized average of 3 hydrophobicity scales ... most robust in correlation analysis ... robustness vaguely defined in methods section as association to multiple, fundamental AA properties using multivariate statistical procedures, thermodynamics and biophysical chemistry considerations ... later find out that this means R² values derived from MLR models ..."

We have expanded upon what this means in our responses to your review. We will incorporate this material into the manuscript and significantly beef up the methods section to accommodate the spirit of your comments. We also note that the statistical correlations were only part of the rationale for defining a “robust” hydrophobicity scale, which is based upon bringing a coherent theoretical analysis to bear upon this work as well. We will be adding some additional material requested by Dr. Carter that should speak to these comments. We offer to make available for your review the TMATCH theory and application papers to provide additional justification for what constitutes a robust hydrophobicity scale as we use the term.

"The methods section needs to be written more clearly"

We agree and will modify the manuscript accordingly.

"... Need better tag description for table 2 owing to the considerable discussion of table two columns in the manuscript ..."

We agree and will modify the manuscript accordingly.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

25 Views

11 Jul 2016 | for Version 1

Charles Carter, Department of Biochemistry and Biophysics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

25 Views Cite this report Responses(1)

Approved With Reservations

References

Competing Interests

No competing interests were disclosed.

Respond to this report

Responses (1)

Author Response

15 Oct 2020

David Cavanaugh, Benchmark Electronics, Huntsville, Alabama, USA

Charlie Carter

"... We should be able to do a better job at homology searches if more about how amino-acid physical chemistry leads to protein structure ... the authors allude to work they have done that demonstrates the value of the new scales they describe here, but there is essentially no coverage of this central question in this manuscript, which is disappointing and detracts substantially from the value of the paper"

This original work is described in 3 manuscripts, including this present paper. The other two manuscripts (theory and applications/results) describing the TMATCH alignment algorithm are almost ready for publication. These three manuscripts were all part of a single monograph that we had decided to partition up owing to length and the fact that the physical chemistry considerations of the current manuscript would be inappropriate for a bioinformatics journal and we needed to get this manuscript published to support another published paper (also available for review). We will incorporate into the present manuscript an appropriate level description of the TMATCH algorithm and actual results from these two manuscripts
A fourth manuscripts which goes significantly more deeply into the biophysics and thermodynamics of hydrophobicity and the broader justification for our hydrophobicity proclivity scale from these two points of view. We would like to provide you a copy of these three other manuscripts, which are in an advanced state of preparation, in answer to a number of your points. We agree with you that more summary material needs to be brought into this manuscript.
We point out the excellent linear relationship between our hydrophobicity proclivity scale and the dpx average residue depth (from water) scale in figure 5 of the manuscript. The dpx scale is derived from a suite of proteins with actual solved folded structures, hence dpx is in effect a protein structural description of hydrophobicity where the deeper on average that a residue is buried, the more hydrophobic it is. Contrast the dpx concept with the ideas of average solvent exposed are, average buried area, percent buried, etc. When the relationships in figures 3-5 (with supporting literature) are taken together, the hydrophobicity potential driving protein folding is suggested to be the traditional solvent-solvent partitioning model of protein folding where amino-acids partition between aqueous exposure and burial within the hydrophobic core of folded proteins. We believe that this traditional model of the primary driving force for globular protein folding is apt, but needs some updates and modifications that we develop in detail in the “Hydrophobicity revisited: A Molecular Story” manuscript. We suggest in the present manuscript that aqueous clathrates form about the hydrophobic area of amino-acids in an unfolded state (a structural feature) and hydrophobic surface area on folded proteins possess a surface tension like a gas bubble in water. We are suggesting that the physics driving the initial stages of protein folding is the same physics as the coalescing/condensing of bubbles of non-polar species or gas in water are essentially the same physics which cause the mutual attraction of residues based upon the residue hydrophobic area with associated aqueous clathrate. We expand upon this hypothesis in “Hydrophobicity Revisited: A Molecular Story.”
Our aim was to try to develop a scale which reliably represented the central hydrophobic tendency (average) as a central/first order effect that allowed a simple but meaningful paired comparison of amino-acids for homology relationships. Adding additional variables/properties would in our opinion have detracted from our goals of simplicity of calculation and utilization of what we believe is the primary first order effect driving protein folding.

"... this index is the result of statistical modeling from a variety of different scales of what has been called 'hydrophobicity' and derivatives ... derivative variables not described ... presentation is interesting and potentially relevant, but is deficient in its citation of the literature and in results indicating their methods or results to which they allude ... well motivated ... suffers from a variety of methodological and conceptual problems ..."

We agree that an expansion of what we meant by derivative variables should be included in the manuscript
The derivative variables concept referred to in the manuscript are primarily ratios of molecular properties or directly measureable bulk properties of amino-acids. In this way we derive normalized intrinsic thermodynamical properties versus extrinsic/bulk properties. For example, we find that the ratio of amino-acid surface area to their volume has a relationship to hydrophobicity.
Other derivative variables we used could be exemplified by estimating the water cavity volume/surface area using amino-acid volume and surface areas and some assumptions about the aqueous cavity geometry.
We will be expanding our reference list of amino-acid scales utilized to better define the amino-acid property scales we utilized in addition to hydrophobicity scales.
We will also include the work of Wolfenden et. al. into the manuscript.

"... the authors have cited just about every effort to correlate a single hydrophobicity predictor with amino-acid surface exposure in folded proteins ... but have excluded an important hydrophobicity metric derived from classical physical chemistry methods by Wolfenden and co-workers using water/vapor and water/cyclohexane partition distributions of amino-acid side chain mimics ... Wolfenden has persuasively argued that n-octanol is a very unsatisfactory reference solvent for multiple reasons such as the ability of side chains to drag non-repeatable amounts of water into the n-Octanol solvent ... "

Since this experimental approach is unique, we will add additional material to the manuscript to incorporate these considerations and the work of Wolfenden et. al. and the implications of the work.
We agree that the standardization of water concentration within a polar solvent (immiscible with water) capable of hydrogen bonding is difficult in addition to the uncontrolled/variable amount of water dragged into the organic solvent phase in solvent/water partition experiments. We cover these issues at length in “Hydrophobicity Revisited: A Molecular Story,” but we agree that these considerations should be covered to some extent in the present manuscript. We also point out that solvent/water concentration ratios used to calculate free energy of transfers from water to the non-polar phase as a hydrophobicity measure relevant to protein folding suffer a systemic error if the organic solvent is incapable of hydrogen bonding as is the hydrophobic “solvent like” core of folded globular proteins. Our hydrophobicity proclivity scale has no units, but rather is a normalized proclivity. We do show in figure 4 of the manuscript that our hydrophobicity proclivity scale is directly relatable to free energy of solvent-solvent transfer using a popular dGow (water to n-octanol) scale.
We also point out that characterizing the amino-acid secondary groups alone does not treat the important contribution of the free energetics of folding by peptide bonds, such as is accounted for by some researchers by treating the 20 amino-acids as guests in the center of tri-peptides in octanol to water free energy studies.

"... omitting the Wolfenden free energies is a grave oversight ... are looking for signal in a variety of data sets that have already been corrupted by similar unsuccessful attempts by previous investigators who have kludged the extant variety of multiple scales ... any useful result the present authors may have achieved is likely to be idiosyncratic and only indirectly based upon physical chemistry ..."

We will include and discuss the work of Wolfenden et. al. due to its novelty and the additional insight that it provides
We largely agree with these statements by Dr. Carter regarding the large profusion of hydrophobicity scales in the literature. Our criticisms are a bit more muted in that we recognize and discuss some of the difficulties in trying to reflect the wide range of amino-acid behaviors and the difficulties inherent in trying to define hydrophobicity as a experimentally measurable concept related to the driving forces of protein folding.
We have relied upon other experimentally measured amino-acid properties to cross check our hydrophobicity proclivity scale using methods such as Multiple Linear Regression (MLR) as can be seen in table 3. For example, we can draw a good linear correlation between certain hydrophobicity scales, such as our hydrophobicity proclivity scale, and amino-acid reverse phase (C₁₈ column) HPLC retention times (other researchers have drawn this conclusion as well).

"... The authors provide no evidence of statistical tests that might suggest significance ..."

We will provide both an F test from the MLR software we used and a Student’s T test of the MLR (8 amino-acid property class scales) correlation coefficient R. We will show these results in a new table with amino-acid properties versus the R², T and F significances. The MLR is significant where both the T test and the F test are significant at an alpha of 0.05 or better (actual alpha = 0.05*0.05 =0.0025 or better). These statistical significance tests are on the MLR results and not the individual MLR coefficients.

"... the correlations they describe ... are very likely to be successful only in proportion to the number of parameters from which their models are built ... suspect some circular logic ..."

The derivative scales as discussed above were derived from fundamental molecular properties or experimentally measured, so there is no concern on that score.
The row scale property relationships with the properties in columns 1, 2 and 4 of table two reflect paired comparisons are independent amino-acid scales.
The eight property class scales in column three of table 3 reflected in MLR pair wise comparisons in column two are independent, although 4 of the amino-acid property class scales are also reflected in in columns 1, 2 and 4 of table 3. The other 4 amino-acid property class scales come from separate relationships as described in the manuscript. The last 3 amino-acid property class scales in table 4 derived from an ANOPA analysis are based upon 49 amino-acid property scales, none of which appear in the row property scales, thus, are independent.
The 8 property class scales in table 4 result in statistically significant MLR relationships with about 50 amino-acid property scales not included in the ANOPA analysis of 49 amino-acid property scales, but the MLR back check of these 49 amino-acid property scales resulted in most of these 49 amino-acid property scales having statistically significant results.
Overall, about 150 amino-acid property scales were evaluated in the work.

"... probable that the inability of previous researchers to arrive at a single scale that predicts the accessible surface area in folded proteins arises because the problem itself is multi-dimensional ... authors describe a variety of classification schemes derived from attempts to rationalize scatterplots of amino acid properties ... authors mention useful additional classification likely related to size of AA side chain ..."

We agree that there are multiple factors involved with the free energetics of protein folding and that the available amino-acid property scales and hydrophobicity scales measure different aspects of the relationship of amino-acids to each other and to water. We believe that we should be able to related fundamental molecular properties of amino-acids and water into a meaningful larger picture.
In the “Hydrophobicity Revisited: A Molecular Story” manuscript we demonstrate that the controlling factors for our hydrophobicity proclivity energy scale (regression corrected dGow) are: #H-bonds, polar surface area, non-polar surface area, diameter of the side chain, length of the side chain, residue volume, polar area/volume and surface area/entropy. The MLR regression coefficient of determination is 99.995% and is F test significant at 6.4x10^-15.
However, the point of the present work is to find a single scale that measures a central tendency of “hydrophobicity” with the assumption that residue hydrophobicity (the contrast with water) is the dominant relationship between amino-acids for the purposes of protein alignments.

"... Read Wolfendson and Carter ... AA free energies of vapor to water partition vs. side chain volume ... Wolfenden & Carter accurately predicts Moelbert ASA with 2D plot/linear regression with AA water/vapor and AA water/cyclohexane partition confidents ..."

Agreed. We will cover this material at an appropriate level of detail given the objective of the present manuscript.

"... need description of TMATCH and homology searches ... use multitude of idiosyncratic hydrophobicity scales describing AA physical chemistry ... notably excluding the only reliable/authentic physical chemistry scale describing AA by Wolfendsen and Carter ... concern that the hydrophobicity proclivity scale produces a linear combination that correlates well with the Moelbert ASA scale by hidden , but still circular reasoning ..."

We will provide the TMATCH theory and applications manuscripts to supplement the material available for review.
We will also include an appropriate amount of material from these two TMATCH manuscripts into the current manuscript.
The Juretic average, Rose percent buried and Neumaier X hydrophobicity scales are independent of each other and the Moelbert ASA scale, therefore there can be no circular or self-referential reasoning/logic involved. Our point is that these three hydrophobicity scales are the best performers that we found and that a normalized average of these three scales will be a top performer as a robust, central tendency hydrophobicity scale.
As seen above from the results in the “Hydrophobicity Revisited: A Molecular Story” manuscript, our hydrophobicity proclivity scale can be partitioned into the sums of distinct amino-acid molecular properties and as such the scale is directly relatable to first principles, which by definition are not circular in nature.
We believe we have demonstrated that our hydrophobicity scale embodies the central tendency (i.e. first order effect) of the hydrophobicity phenomenon.
We believe that the manuscript improvements requested Dr. Jeroncic will obviate any additional concerns with circular reasoning/analysis.

"Seven references to papers where Wolfendsen and/or Carter are one of the authors"

Agreed. Thanks.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Cornette JL, Cease KB, Margalit H, et al.: Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J Mol Biol. 1987; 195(3): 659–685. PubMed Abstract | Publisher Full Text

[2] 2. Li H, Tang C, Wingreen NS: Nature of driving force for protein folding: A result from analyzing the statistical potential. Phys Rev Lett. 1997; 79: 765–768. Publisher Full Text

[3] 3. Rose GD, Geselowitz AR, Lesser GJ: Hydrophobicity of amino acid residues in globular proteins. Science. 1985; 229(4716): 834–838. PubMed Abstract | Publisher Full Text

[4] 4. Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res. 1999; 27(1): 368–369. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Tomii K, Kanehisa M: Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 1996; 9(1): 27–36. PubMed Abstract | Publisher Full Text

[6] 6. Creighton TE: Proteins: Structure and Molecular Properties. WH Freeman and Company. 2 edition. 1993. Reference Source

[7] 7. Karplus PA: Hydrophobicity regained. Protein Sci. 1997; 6(6): 1302–1307. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Cavanaugh DP, Sternberg RV: Analysis of morphological groupings using anopa, a pattern recognition and multivariate statistical method: A case study involving centrarchid fishes. J Biol Syst. 2004; 12(2). Publisher Full Text

[9] 9. Neumaier A, Huyer W, Bornberg-Bauer E: Hydrophobicity analysis of amino acids. 1999. Reference Source

[10] 10. Juretic D, Jeroncic A, Zucic D: Sequence analysis of membrane proteins with the web server split. Croat Chem Acta. 1999; 72(4): 975–997. Reference Source

[11] 11. Engelman DM, Steitz TA, Goldman A: Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem. 1986; 15(1): 321–353. Publisher Full Text

[12] 12. Hopp TP, Woods KR: Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci U S A. 1981; 78(6): 3824. PubMed Abstract | Free Full Text

[13] 13. Kyte J, Doolittle R: A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982; 157(1): 105–132. PubMed Abstract | Publisher Full Text

[14] 14. Eisenberg D, Weiss RM, Terwilliger CT, et al.: Hydrophobic moments and protein structure. Faraday Symp Chem Soc. 1982; 17: 109–120. Publisher Full Text

[15] 15. Janin J: Surface and inside volumes in globular proteins. Nature. 1979; 277(5696): 491–492. PubMed Abstract | Publisher Full Text

[16] 16. Chothia C: Hydrophobic bonding and accessible surface area in proteins. Nature. 1974; 248(446): 338–339. PubMed Abstract | Publisher Full Text

[17] 17. Bordo D, Argos P: Suggestions for "safe" residue substitutions in site-directed mutagensis. J Mol Biol. 1991; 217(4): 721–729. PubMed Abstract | Publisher Full Text

[18] 18. Online. Solvent accessibility. [Online Data]. Bordo Table 2: Solvent Exposed Area > 30 square angstroms calculated from data taken from 55 proteins in the Brookhaven data base, coming from 9 molecular families: globins, immunoglobins, cytochromes c, serine proteases, subtilisins, calcium binding proteins, acid proteases, toxins and virus capsid proteins. Reference Source

[19] 19. Fauchere JL, Pliska VE: Amino acid scale: Hydrophobicity scale. Eur J Med Chem. 1983; 18: 369–375. Reference Source

[20] 20. Pintar A, Carugo O, Pongor S: Atom depth in protein structure and function. Trends Biochem Sci. 2003a; 28(11): 593–7. PubMed Abstract | Publisher Full Text

[21] 21. Pintar A, Carugo O, Pongor S: Atom depth as a descriptor of the protein interior. Biophys J. 2003b; 84(4): 2553–61. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Susanne M, Eldon E, Chao T: Correlation between sequence hydrophobicity and surface-exposure pattern of database proteins. Protein Sci. 2004; 13(3): 752–762. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Trinquier G, Sanejouand YH: Which effective property of amino acids is best preserved by the genetic code? Protein Eng. 1998; 11(3): 153–169. PubMed Abstract | Publisher Full Text

[24] 24. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 1978; 5(3): 345–352. Reference Source

[25] 25. Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22): 10915–9. PubMed Abstract | Free Full Text

[26] 26. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science. 1992; 256(5062): 1443–5. PubMed Abstract | Publisher Full Text

[27] 27. Brick K, Pizzi E: A novel series of compositionally biased substitution matrices for comparing Plasmodium proteins. BMC Bioinformatics. 2008; 9: 236. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Keane TM, Creevey CJ, Pentony MM, et al.: Assessment of methods for amino acid matrix selection and their use on empirical data shows that ad hoc assumptions for choice of matrix are not justified. BMC Evol Biol. 2006; 6: 29. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Kosiol C, Goldman N: Different versions of the Dayhoff rate matrix. Mol Biol Evol. 2005; 22(2): 193–9. PubMed Abstract | Publisher Full Text

[30] 30. Tseng YY, Liang J: Estimation of amino acid residue substitution rates at local spatial regions and application in protein function inference: a Bayesian Monte Carlo approach. Mol Biol Evol. 2006; 23(2): 421–436. PubMed Abstract | Publisher Full Text

Residue	PC 1	PC 2	PC 3	PC 4	PC 5	PC 6	PC 7	PC 8
A	0	1	1	2	2	1	2	2
C	0	0	2	2	4	1	2	2
D	2	1	1	3	0	1	1	0
E	2	1	1	3	1	0	0	0
F	1	0	0	0	3	1	2	1
G	0	1	2	3	1	1	2	1
H	2	1	0	2	2	1	2	1
I	0	0	1	0	4	0	2	3
K	2	1	0	3	2	1	1	0
L	0	0	2	1	4	1	3	3
M	1	0	1	1	3	1	2	2
N	2	1	1	3	0	1	1	0
P	1	1	1	3	3	0	1	1
Q	2	1	1	3	1	0	0	0
R	3	1	0	3	1	1	1	0
S	1	1	2	3	1	1	2	1
T	1	1	1	3	2	0	1	1
V	0	0	1	1	3	0	2	3
W	2	0	0	1	3	1	2	2
Y	2	0	1	2	2	0	1	1

Residue	PC 1	PC 2	PC 3	PC 4	PC 5	PC 6	PC 7	PC 8
A	0	1	1	2	2	1	2	2
C	0	0	2	2	4	1	2	2
D	2	1	1	3	0	1	1	0
E	2	1	1	3	1	0	0	0
F	1	0	0	0	3	1	2	1
G	0	1	2	3	1	1	2	1
H	2	1	0	2	2	1	2	1
I	0	0	1	0	4	0	2	3
K	2	1	0	3	2	1	1	0
L	0	0	2	1	4	1	3	3
M	1	0	1	1	3	1	2	2
N	2	1	1	3	0	1	1	0
P	1	1	1	3	3	0	1	1
Q	2	1	1	3	1	0	0	0
R	3	1	0	3	1	1	1	0
S	1	1	2	3	1	1	2	1
T	1	1	1	3	2	0	1	1
V	0	0	1	1	3	0	2	3
W	2	0	0	1	3	1	2	2
Y	2	0	1	2	2	0	1	1

A hydrophobic proclivity index for protein alignments

Abstract

Keywords

Introduction

Approach

Methods

Figure 1. Hydrophobic Proclivities versus Area per specific volume of amino acids.

Figure 2. Hydrophobic Proclivities versus specific absolute entropy.

Results

Figure 3. Normalized average of several hydrophobic scales with Solvent Exposed area.

Figure 4. Hydrophobic Proclivities versus Structure F & P Gtow.

Table 1. Table of Regression Fitted Hydrophobic Proclivities.

Figure 5. Hydrophobic Proclivities versus Structure based mean residue depth.

Table 2. Linear correlation between hydrophobicity scales and AA physico-chemical properties.

Discussion

Table 3. Property Class Index Vectors #1 - #8.

Conclusion

Author contributions

Competing interests

Grant information

Acknowledgements

References

Comments on this article Comments (1)

Open Peer Review

Comments on this article Comments (1)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Residue	PC 1	PC 2	PC 3	PC 4	PC 5	PC 6	PC 7	PC 8
A	0	1	1	2	2	1	2	2
C	0	0	2	2	4	1	2	2
D	2	1	1	3	0	1	1	0
E	2	1	1	3	1	0	0	0
F	1	0	0	0	3	1	2	1
G	0	1	2	3	1	1	2	1
H	2	1	0	2	2	1	2	1
I	0	0	1	0	4	0	2	3
K	2	1	0	3	2	1	1	0
L	0	0	2	1	4	1	3	3
M	1	0	1	1	3	1	2	2
N	2	1	1	3	0	1	1	0
P	1	1	1	3	3	0	1	1
Q	2	1	1	3	1	0	0	0
R	3	1	0	3	1	1	1	0
S	1	1	2	3	1	1	2	1
T	1	1	1	3	2	0	1	1
V	0	0	1	1	3	0	2	3
W	2	0	0	1	3	1	2	2
Y	2	0	1	2	2	0	1	1