Keywords
Chemical space, active compounds, differential evolution, support vector regression, virtual screening, inverse QSAR
This article is included in the Japan Institutional Gateway gateway.
This article is included in the Cheminformatics gateway.
Chemical space, active compounds, differential evolution, support vector regression, virtual screening, inverse QSAR
Inverse quantitative structure-activity relationship (QSAR) analysis aims to identify values of descriptors used to generate a QSAR model that corresponds to high activity, and build structures of active compounds from these values1–4. The inverse QSAR process is challenging since numerical signatures of activity, if they can be determined, must be re-translated into viable chemical structures and active compounds, a task falling into the area of de novo compound design5–7. A predominant approach to inverse QSAR is the use of multiple linear regression (MLR) models to construct chemical graphs that correspond to an MLR equation1–4. Given this equation, a desired y (activity) value constrains relationships between descriptor settings. These constraints make it possible to derive vertex degree or edge sequences, from which chemical graphs might be constructed. For instance, specialized descriptors have been introduced for inverse QSAR on the basis of MLR equations and algorithms for constructing chemical graphs from these descriptors8–11. So far only few inverse QSAR studies have employed methods other than MLR. For example, it was attempted to construct chemical graphs from the centroid of activity of a set of compounds in Hilbert space defined by a kernel function12. In this case, a pre-image approximation algorithm was used to obtain coordinates in descriptor space and construct chemical graphs from these descriptor coordinates. Alternatively, inverse QSAR was divided into a two-stage process by separating the derivation of preferred descriptor values for a desired activity from the chemical graph construction phase13–15. Descriptor information corresponding to a given y value was represented via probability density functions, and regression analysis was performed using Gaussian mixture models in combination with cluster-wise MLR14. Subsequently, chemical graphs satisfying a set of descriptor values, or ranges of descriptor values, were generated by assembling ring systems and atom fragments with monotonically changing descriptors14. Following this approach, descriptor values must increase when adding an atom, ring system, or other structural fragment to a growing chemical graph. Applying Gaussian mixture models and cluster-wise MLR makes it possible to focus on the applicability domain14,15 of the underlying models.
The two-stage inverse QSAR process is conceptually based on an important premise adopted from conventional (forward) QSAR, i.e., the higher a predicted activity value is, the more desirable a chemical structure becomes. In two-stage inverse QSAR, this conjecture challenges the descriptor value generation phase because value combinations are ultimately desired that correspond to higher predicted activity than exhibited by any currently available training or test compound. In other words, descriptor settings should be optimized for predicted activity. For this purpose, the use of Gaussian mixture models and cluster-wise MLR left considerable room for improvement, due to its multi-parametric nature and tendency of overfitting if training data were organized into large number of clusters14.
In this work, the descriptor optimization challenge of two-stage inverse QSAR has been specifically addressed. We emphasize that the chemical graph construction phase of inverse QSAR is not subject of this work and beyond its scope. Rather, our focal point has been the development of a new methodology for optimizing descriptor settings with respect to higher than observed compound activity, as a prerequisite for candidate structure generation. Therefore, an evolutionary approach is introduced to identify descriptor coordinates that correspond to the highest predicted activity within the applicability domain of a given QSAR model. The methodology and results of proof-of-concept studies are presented in the following.
Inverse QSAR depends on the derivation of descriptor coordinates for a given model and data set. The goal of the methodology presented herein is finding desirable coordinates in a pre-defined descriptor space (x space) on the basis of a regression function f(x) representing a QSAR model. Confining the search to the applicability domain (AD) of the model translates this task into a constrained optimization problem (COP). The concept of the optimization is illustrated in Figure 1. Newly derived coordinates should be more desirable with respect to pre-defined evaluation criteria than any other data point used to construct the regression model. Accordingly, COP is formulated as follows:
A regression function f(x) fits training data to determine new coordinates in descriptor space. Optimized coordinates based on f(x) fall inside the training data range but yield a higher y value than any other data point.
where x ∈ Rd, f: Rd → R is the function to be optimized, gj: Rd → R is the j-th inequality function, and hj: Rd → R is the j-th equality function. The i-th component of x: xi falls into the range [li, ui].
For the purpose of our analysis, the following assignments are made:
x: descriptors;
–f(x): QSAR model;
–g1(x): AD model;
gk(x) (k = 2, …, j), h(x): constraints for descriptors;
li, ui: lower and upper bounds of i-th descriptor.
Constraints are applied to descriptors to ensure meaningful value ranges. For example, if the ‘number of heavy atoms’ (xp) and ‘number of hydrogen bond acceptors’ (xq) are selected descriptors, a value of five for the former and six for the latter would be impossible for any given data point (compound). Therefore, in order to prevent such settings, an inequality constraint is required and applied: xq – xp ≤ 0.
For addressing COP, the differential evolution (DE) algorithm originally introduced by Stone and Price16 is investigated herein, which has so far not been considered in inverse QSAR. However, given the conceptual simplicity and computational efficiency of DE, the algorithm has been successfully applied to solve optimization problems in other areas of science and engineering, for example, in scheduling of flow shops17. In addition, for deriving a COP solution efficiently, ε differential evolution (εDE) was introduced by Takahama et al. as an extension of DE18, illustrated in Figure 2. A candidate vector v for next generation (also called mutant vector) is derived on the basis of three randomly selected vectors:
The steps involved in evolutionary optimization are outlined. First, a candidate v is obtained from three randomly selected individuals by a differential operation. Second, a crossover operation is applied to an individual xi and the candidate. Third, the evaluation step involves ε level comparison of v and xi and results in the next individual.
where xr1, xr2, xr3 are different vectors from the current generation and F represents a scale parameter for the difference vector. If the i-th component of v: vi falls outside the range [li, ui], v is updated as follows:4
An exponential crossover operation with probability-based crossover points is applied to v (the probability is called CR). Either xi or v is selected as xi+1, the individual for the next generation, following ε level comparison of the corresponding vectors.
For prioritizing candidates, given constraints and the optimized function are taken into account. The constraint violation Φ(x) is defined as follows:
where Φ(x) represents the degree of violation, with p set to one. The ε level comparison determines the order between two sets of pair (f(x), Φ(x)):
where t represents a generation in DE. As a decreasing function of t, ε determines the tolerance of constraint violation and ε(t) is determined as follows:16
where xθ is the top θ-th individual, Tc determines the generation in which ε(t) becomes zero, and cp the convergence speed. During the optimization, Φ(x) gradually outweighs f(x). In the initial stages of the optimization ε(t) settings enable the selection of diverse candidates but convergence of the algorithm is determined by Φ(x) becoming zero. Accordingly, Tc was set to one herein.
For εDE optimization, any regression function can be employed. In this study, support vector regression (SVR)19 with ν parameter was applied and the AD was defined by one-class support vector machine (OCSVM)20 classification with ν parameter. This parameter ranges from zero to one and defines the upper bound of the fraction of margin error and lower bound of the fraction of support vectors. AD consists of regions where the output of OCSVM is greater than or equal to zero. For both SVR and OCSVM, the radial basis function (RBF) kernel: k(xi, xj) = (−γ||xi − xj||2) was used. A hyper parameter set {C, ν, γ} for ν-SVR was determined by cross validation of training data on the basis of Q2. For OCSVM model construction, γ was set to maximize the variance of the Gram matrix consisting of the kernel function of the training data21 and ν was set to 0.01.
Data points on a (x1, x2) plane were randomly generated for x1: [-2 3], x2: [-1, 4] to yield 50 training and 20 test instances. The corresponding y values were calculated using Mishra’s bird function (https://mpra.ub.uni-muenchen.de/2718/) adding a Gaussian error with a mean of zero and variance of one, defined as:
Three independent trials were carried out with different random number generators. Training and test data sets were plotted on the output domain of the bird function with color-coded true y values (Figure 3).
From ChEMBL22 (version 22), compound data sets were selected using the following criteria: Maximal assay confidence score of ‘9’, interaction relationship type ‘D’(direct), activity standard unit ‘nM’, activity standard type ‘Ki’, and activity standard relation ‘=’. When multiple Ki values were available for a compound, their geometric mean was calculated to yield its final potency value, provided all measurements fell into the same order of magnitude (otherwise, the compound was discarded). In-house implementation of substructure filters for pan assay interference compounds (PAINS)23 and other reactive molecules were applied to eliminate compounds with potential chemical liabilities. Filtering was not critical for modeling, but active compounds with sound chemical structures were desired. From qualifying data sets, nine activity classes were randomly selected, as summarized on Table 1.
Nine compound activity classes were taken from ChEMBL (version 22). For each activity class, the target ID (TID), number of compounds (CPDs) for which descriptor values were obtained, and number of descriptors following variable selection are reported.
For each compound, 45 descriptors were initially calculated using RDKit (http://www.rdkit.org). These descriptors included constitutional descriptors (e.g., MW, number of rings, number of rotatable bonds), topological descriptors (e.g., Chi and Kier indices24) and partial charge descriptors based on chemical graph’s topology (i.e., maximum of Gasteiger/Marsali partial charges25). From correlated pairs of descriptors exceeding a correlation coefficient of 0.9, only one was chosen. For each activity class, the final number of descriptors (variables) is reported in Table 1. Compounds from each class were randomly divided into equally sized training and test data sets.
To test the ability of virtual screening (VS) to identify new active compounds from optimized coordinates, ChEMBL (version 22) was used as a screening source. From 1,414,176 unique compounds passing the substructure filters, training molecules used for modeling of each activity class were removed. All remaining ChEMBL compounds provided a large screening source for VS. For screening compounds, descriptors were calculated as described above and the compounds falling inside the AD of each class-specific model were preselected. Active compounds from each activity class not used for training represented true-positive test instances, regardless of their potency values. The calculation of descriptors for more than 1.4 million screening compounds was computationally demanding and exceeded the requirements for coordinate optimization.
For ChEMBL screening compounds including test instances, two VS ranking were generated. First, Euclidean distances to optimized coordinates were calculated. In this case, compound potency was not considered for ranking. Second, pKi values were predicted for all screening compounds using the class-specific SVR models. The latter calculations were carried out to determine if true positives were highly ranked on the basis of activity predictions. The area under the receiver operating characteristic curves (AUC) was calculated as an evaluation criterion.
Two proof-of-concept studies were carried out, one using simulation data, the other compounds and their activities. For simulation data, AD and regression models were constructed with the training data from each trial. Training data range scaling within the interval [-1,1] was applied prior to model building. For the SVR models, preferred parameter settings were determined using 10-fold cross validation. Coordinate optimization was carried using individual training data points. Optimized coordinates were evaluated on the basis of true y and maximal training data y values.
The same protocol for coordinate optimization was applied to each compound activity class. Furthermore, for hyper-parameter optimization of SVR, five-fold cross validation was carried out. For εDE, predicted pKi values falling into the AD of each model were used to ensure that optimized coordinates were proximal to compound coordinates, as assessed by distance calculations. Furthermore, optimized coordinates were projected on principal component analysis (PCA) maps of the x space formed by the first and second PC.
The following εDE parameter settings were applied: Number of iterations, 1,000 and 10,000 for simulation and compounds data sets, respectively; F, 0.5; Tc,1; p, 1; CR, 0.9 for all data sets. An initial population was obtained using 50 vectors of training instances for simulation data and 511 to 1,193 vectors for training compounds, depending on the size of the compound data sets.
Finally, the ability of distance- and SVR-based VS to predict new active compounds was analyzed. Although de novo structure generation was beyond the scope of our investigation, VS might be considered as an alternative way to identify novel active compounds, which was thus examined in the context of our study.
All SVR models and ADs were constructed with Scikit-learn26 0.18.1 using Python. εDE was implemented in C++. Descriptors were calculated using RDKit interfaced with Python.
All selected compound entries were standardized by removal of ions and solvent molecules and structure regularization, according to the OEChem toolkit (v1.7.7; OpenEye Scientific Software, Inc. Santa Fe, NM, USA).
Optimization of descriptor coordinates for preferred values of a given model is a central aspect of the two-stage inverse QSAR process, for which currently only approximate solutions exist. Therefore, a more accurate methodology for coordinate optimization is highly desirable, as investigated herein. The evaluation of εDE as an optimization method for this critical task was inspired by previous results obtained for other types of optimization problems where this approach displayed better performance than alternative evolutionary methods, such as genetic algorithms or particle swarm optimization27–29. Moreover, ε-based lexicological comparison of individual feature vectors makes this algorithm straightforward to apply to problems where several constraints must be balanced, as is the case in inverse QSAR. Studies on simulation and compound data sets were designed to evaluate whether εDE was capable of effectively optimizing coordinates on the basis of a regression function.
For initial proof-of-concept, εDE-based search for optimized coordinates was carried out using simulation data generated as described above.
For the three simulation data sets, SVR models were built yielding optimized parameters {γ, ν, C} of {4, 0.25, 2}, {2, 0.25, 1}, {1, 0.125, 16}, respectively. For all OCSVM models, γ was 1. As reported in Table 2, these SVR models accounted for the output of the bird function in a statistically meaningful way.
For each trial, model performance was assessed on the basis of R2 and root mean square error (RMSE) values for training and test data sets.
Trial | Training | Test | ||
---|---|---|---|---|
R2 | RMSE | R2 | RMSE | |
1 | 1.00 | 1.10 | 0.87 | 5.17 |
2 | 0.96 | 3.22 | 0.70 | 12.73 |
3 | 0.97 | 3.17 | 0.84 | 11.99 |
Figure 4 shows the different prediction surfaces of the SVR models for the three trial sets. The surfaces of set one and two were overall similar, whereas the surface of set three differed from the others. In each case, however, individual vectors converged at a single point (Table 3) and optimized coordinates were located in regions of highest predicted y values (Figure 4). In set one, for which the SVR model overall best accounted for the bird function, a training data point was found adjacent to the optimized coordinates, which slightly exceeded the largest predicted y value (Table 3).
For each of three independent trials, optimized coordinates (red squares) and training data (black dots) were mapped on the SVR prediction surface.
For each simulation data trial, y values predicted by the SVR model are reported. For training data, the largest measured y value is given in parentheses. “Domain” is defined by x1 and x2 with a resolution of 0.005. AD refers to the applicability domain of the OCSVM model. For optimized coordinates, the result of the bird function is given in parentheses as the true y value. (i.e., the y value without error).
Trial | Training (measured y) | Domain | AD | Optimized coordinates (true y) |
---|---|---|---|---|
1 | 48.92 (50.02) | 50.19 | 50.19 | 50.19 (49.54) |
2 | 26.22 (29.46) | 31.14 | 31.14 | 31.14 (48.09) |
3 | 58.87 (55.18) | 71.94 | 59.40 | 59.41 (55.15) |
For set two, no training data were mapped to local maxima of the bird function, which resulted in a difficult regression scenario. The maximal measured y value in the training data was 29.46 and for optimized coordinates (1.51, 3.16), the predicted y value was 31.14, also slightly exceeding the largest measured y value. However, the predicted value was much smaller than the corresponding true y value of 48.09 (Table 3), due to the inherent regression limitation.
For set three, the maximal true y value within the domain was 56.18 at (-1.59, 0.06). In this case, several training data points were mapped to regions of y values into which optimized coordinates fell (Figure 4), leading to an extrapolative over-prediction of the corresponding y value of 71.94. However, this over-prediction was correctly balanced when the AD of the model was considered instead, leading to a value of 59.40 and a predicted y value for the optimized coordinates of 59.41 (Table 3).
Despite typical regression limitations highlighted by findings for set two and three, the results obtained for simulation data indicated the potential of εDE for predicting optimized coordinates. Importantly, all solutions converged to single vectors representing novel points in simulation data space with large predicted y values falling inside the AD. Taken together, these results indicated the principal potential of εDE for coordinate optimization on the basis of SVR modeling.
Next εDE optimization was applied to different compound activity classes. In each case, SVR models were derived, optimized coordinates determined, and activity values predicted.
For each compound class, optimized coordinates yielded larger predicted pKi values than any training or test set compound (Table 4), consistent with the methodological strategy. Optimized coordinates fell inside the AD of each model and were proximal to several active compounds. Nearest neighbors of optimized coordinates were mostly predicted to be highly active (Table 4), indicating the presence of smooth prediction surfaces in the vicinity of optimized coordinates.
For optimized coordinates, the predicted pKi value and the output of the OCSVM model for the applicability domain (AD) are reported. For training and test instances, the predicted pKi value and scaled distance from optimized coordinates are given for the nearest neighbor (NN).
Prediction surfaces were further characterized graphically by systematically comparing predicted pKi values of compounds and calculated distances to optimized coordinates. Figure 5 shows the results for two exemplary activity classes, and Figure S1 shows the results for all classes. For set 51 (5-HT1a receptor ligands) in Figure 5, many highly active compounds were located proximal to the optimized coordinates, indicating that these coordinates fell into a well-populated region of activity-relevant chemical space. For set 194 (factor X inhibitors), training and test compounds tended to exhibit higher predicted pKi with decreasing distance to the optimized coordinates, hence delineating regions of activity progression, which are relevant for compound optimization and exploitation of optimized coordinates.
For two exemplary activity classes, predicted pKi values are related to the scaled distance of the corresponding compounds to the optimized coordinates. Training data (cyan squares), test data (magenta squares), and optimized coordinates (green circle) are shown.
Data set compounds and optimized coordinates were also projected onto PCA plots of descriptor space (Figure 6,Figure S2). These projections revealed that optimized coordinates were central to activity class regions in feature space. Furthermore, Figure 7 shows structures of the three nearest neighbors of the optimized coordinates for sets 51 and 194. In both cases, these compounds were structural analogs. Hence, similarity in feature space corresponded to close structural relationships. Consequently, this would also apply to structure generation from optimized coordinates, which would result in additional analog(s), consistent with the principles of QSAR and inverse QSAR.
For the two activity classes from Figure 5, training data (cyan squares), test data (magenta squares), and optimized coordinates (green circle) were projected on a principal component (PC) plot derived from training data. For PC1 and PC2, contributions to data set variance are reported in %.
ChEMBL compounds were screened relative to optimized coordinates from the nine activity classes and Euclidian distances were determined. Furthermore, pKi values were predicted for screening compounds falling into the AD of each class-specific SVR model. The VS calculations ultimately led to alternative distance- and potency-based compound rankings. Table 5 reports screening compound statistics and VS results. For each activity class, screening compounds contained large numbers (497–1152) of true positive test instances. Distance-based VS yielded AUC values of at least 0.6 for five of nine activity classes, with a maximum of 0.76. For the four remaining classes, essentially random predictions were observed. Potency-based VS produced AUC values of greater than 0.6 for seven of nine classes, including values above 0.7 for three classes and a maximum of 0.77. Thus, potency-based predictions led to slightly better compound rankings than distance-based VS relative to optimized coordinates, but prediction accuracy was overall moderate at best. Moreover, the true positive ratio among the 30 top-ranked compounds was generally very low for both distance- and potency-based VS. Figure 8 shows exemplary potency prediction landscapes including optimized coordinates as a reference. Figure S3 shows these representations for all activity classes. Highly potent compounds proximal to optimized coordinates were predicted for several activity classes. However, most true positives were not separated from the bulk of ChEMBL screening compounds on the basis of potency predictions. Overall the ability of VS calculations to identify novel active compounds and separate them from false positives was only limited. Thus, although de novo structure generation from optimized coordinates is challenging, it would be difficult to replace the structure generation step in two-stage inverse QSAR with standard VS calculations. However, despite limited prediction accuracy, the VS calculations provided support for the chemical relevance of optimized coordinates because for each activity class, at least few true positives were among top-scoring screening compounds and proximal to optimized coordinates.
For the two activity classes from Figure 5, predicted pKi values are plotted against the scaled distance of corresponding compounds to the optimized coordinates. ChEMBL compounds (gray squares) and test compounds according to Figure 5 (magenta squares) falling inside the applicability domain are shown. Optimized coordinates are displayed as a green circle.
Compound (CPD) statistics and VS results for distance-based compound rankings relative to optimized coordinates and potency-based rankings are reported.
The optimization of coordinates in feature space for high activity values predicted with a regression model is a central task in two-stage inverse QSAR. For this multi-constraint optimization problem, no generally applicable approach is currently available. The evaluation of differential evolution for coordinate optimization, as reported herein, was motivated by the successful application of this algorithm in areas of science other than chemistry. The study has provided proof-of-concept evidence that εDE is suitable for generating optimized coordinates in given feature spaces. For different compound classes, consistent predictions were obtained for εDE in combination with SVR, displaying robust convergence behavior and yielding optimized coordinates that not only met statistical and data set requirements, but were also chemically relevant, as indicated by compound mapping and distance- or potency-based VS calculations. However, due to limited prediction accuracy, distance-based VS relative to optimized coordinates would not be suitable to replace the de novo structure generation step in inverse QSAR, at least not on the basis of our reference calculations. Regardless, encouraging results were obtained for coordinate optimization. Taken together, the findings reported herein indicate that εDE optimization has the potential to further advance inverse QSAR analysis.
The data sets used in this study are freely available in ChEMBL via the identifiers reported in the manuscript.
The project leading to this report has received funding (for TM) from the Japan Society for the Promotion of Science (JSPS) under the JSPS KAKENHI Grant Number 16J05325.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Supplementary Figures S1-S3: Prediction surfaces and PCA projections for different activity classes and prediction surfaces for the ChEMBL screening data set, respectively. Click here to access the data.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: Prof Bajorath and I are Gateway Advisors for the channel in which the article is published.
Reviewer Expertise: computational chemistry, cheminformatics
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Gómez-Bombarelli R, Duvenaud D, Hernández-Lobato JM, Aguilera-Iparraguirre J, et al.: Automatic chemical design using a data-driven continuous representation of molecules. arXiv. 2017. Reference SourceCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: machine learning, cheminformatics, algorithms
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
No source data required
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Molecular modelling, drug design
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 06 Sep 17 |
|||
Version 1 31 Jul 17 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)