Exploring differential evolution for inverse QSAR analysis

Inverse quantitative structure-activity relationship (QSAR) modeling encompasses the generation of compound structures from values of descriptors corresponding to high activity predicted with a given QSAR model. Structure generation proceeds from descriptor coordinates optimized for activity prediction. Herein, we concentrate on the first phase of the inverse QSAR process and introduce a new methodology for coordinate optimization, termed differential evolution (DE), that originated from computer science and engineering. Using simulation and compound activity data, we demonstrate that DE in combination with support vector regression (SVR) yields effective and robust predictions of optimized coordinates satisfying model constraints and requirements. For different compound activity classes, optimized coordinates are obtained that exclusively map to regions of high activity in feature space, represent novel positions for structure generation, and are chemically meaningful.


Introduction
Inverse quantitative structure-activity relationship (QSAR) analysis aims to identify values of descriptors used to generate a QSAR model that corresponds to high activity, and build structures of active compounds from these values [1][2][3][4] . The inverse QSAR process is challenging since numerical signatures of activity, if they can be determined, must be re-translated into viable chemical structures and active compounds, a task falling into the area of de novo compound design [5][6][7] . A predominant approach to inverse QSAR is the use of multiple linear regression (MLR) models to construct chemical graphs that correspond to an MLR equation [1][2][3][4] . Given this equation, a desired y (activity) value constrains relationships between descriptor settings. These constraints make it possible to derive vertex degree or edge sequences, from which chemical graphs might be constructed. For instance, specialized descriptors have been introduced for inverse QSAR on the basis of MLR equations and algorithms for constructing chemical graphs from these descriptors [8][9][10][11] . So far only few inverse QSAR studies have employed methods other than MLR. For example, it was attempted to construct chemical graphs from the centroid of activity of a set of compounds in Hilbert space defined by a kernel function 12 . In this case, a pre-image approximation algorithm was used to obtain coordinates in descriptor space and construct chemical graphs from these descriptor coordinates. Alternatively, inverse QSAR was divided into a two-stage process by separating the derivation of preferred descriptor values for a desired activity from the chemical graph construction phase [13][14][15] . Descriptor information corresponding to a given y value was represented via probability density functions, and regression analysis was performed using Gaussian mixture models in combination with cluster-wise MLR 14 . Subsequently, chemical graphs satisfying a set of descriptor values, or ranges of descriptor values, were generated by assembling ring systems and atom fragments with monotonically changing descriptors 14 . Following this approach, descriptor values must increase when adding an atom, ring system, or other structural fragment to a growing chemical graph. Applying Gaussian mixture models and cluster-wise MLR makes it possible to focus on the applicability domain 14,15 of the underlying models.
The two-stage inverse QSAR process is conceptually based on an important premise adopted from conventional (forward) QSAR, i.e., the higher a predicted activity value is, the more desirable a chemical structure becomes. In two-stage inverse QSAR, this conjecture challenges the descriptor value generation phase because value combinations are ultimately desired that correspond to higher predicted activity than exhibited by any currently available training or test compound. In other words, descriptor settings should be optimized for predicted activity. For this purpose, the use of Gaussian mixture models and cluster-wise MLR left considerable room for improvement, due to its multi-parametric nature and tendency of overfitting if training data were organized into large number of clusters 14 . Recently, autoencoder modeling was proposed as an approach for two-stage inverse QSAR 16 . Continuous latent space, corresponding to a descriptor space, is constructed on the basis of encoding a line notation of a molecule by recurrent neural networks (RNNs). Following this methodology, optimized coordinates in latent space can be directly translated into another line notation by the decoder consisting of RNNs. As such, the approach does not depend on chosen descriptors and has the potential to automatically address two-stage inverse QSAR in a single step. However, the generation of new valid line notations (SMILES strings) for chemical structures corresponding to optimized coordinates was difficult in a case study designing organic light-emitting diodes 16 .
In this work, the descriptor optimization challenge of two-stage inverse QSAR has been specifically addressed. We emphasize that the chemical graph construction phase of inverse QSAR is not subject of this work and beyond its scope. Rather, our focal point has been the development of a new methodology for optimizing descriptor settings with respect to higher than observed compound activity, as a prerequisite for candidate structure generation. Therefore, an evolutionary approach is introduced to identify descriptor coordinates that correspond to the highest predicted activity within the applicability domain of a given QSAR model. The methodology and results of proof-of-concept studies are presented in the following.

Methodological concept
Inverse QSAR depends on the derivation of descriptor coordinates for a given model and data set. The goal of the methodology presented herein is finding desirable coordinates in a pre-defined descriptor space (x space) on the basis of a regression function f(x) representing a QSAR model. Confining the search to the applicability domain (AD) of the model translates this task into a constrained optimization problem (COP). The concept of the optimization is illustrated in Figure 1. Newly derived coordinates should be more desirable with respect to pre-defined evaluation criteria than any other data point used to construct the regression model. Accordingly, COP is formulated as follows: g j (x) ≤ 0, j = 1, …, q h j (x) = 0, j = q+1, …, m

Amendments from Version 1
In response to the reviewers, we have addressed all issues raised by Hans Matter, incorporated the autoencoder study suggested by Brian Goldman and discussed how families of structures might be evolved. In addition, as suggested by Pat Walters, we make the data sets and descriptors available as an open access deposition (a reference implementation is in the works). Furthermore, we have added an example for deriving the applicability domain and specified the RMSE in Table 2. Finally, we note that structures in Figure 7 were partly compromised due to a reproduction error from a PDF file (the drawings have been corrected) and that standard distance-and potency-based compound rankings were provided as virtual screening reference calculations. In addition, an MMP-based search for potent analogs of compounds proximal to optimized coordinates has also been carried out. Figure 1. Optimization concept. A regression function f(x) fits training data to determine new coordinates in descriptor space. Optimized coordinates based on f(x) fall inside the training data range but yield a higher y value than any other data point.

REVISED
where x ∈ R d , f: R d → R is the function to be optimized, g j : R d → R is the j-th inequality function, and h j : R d → R is the j-th equality function. The i-th component of x: For the purpose of our analysis, the following assignments are made: x: descriptors; Constraints are applied to descriptors to ensure meaningful value ranges. For example, if the 'number of heavy atoms' (x p ) and 'number of hydrogen bond acceptors' (x q ) are selected descriptors, a value of five for the former and six for the latter would be impossible for any given data point (compound). Therefore, in order to prevent such settings, an inequality constraint is required and applied: x q -x p ≤ 0.
ε Differential evolution For addressing COP, the differential evolution (DE) algorithm originally introduced by Stone and Price 17 is investigated herein, which has so far not been considered in inverse QSAR. However, given the conceptual simplicity and computational efficiency of DE, the algorithm has been successfully applied to solve optimization problems in other areas of science and engineering, for example, in scheduling of flow shops 18 . In addition, for deriving a COP solution efficiently, ε differential evolution (εDE) was introduced by Takahama An exponential crossover operation with probability-based crossover points is applied to v (the probability is called CR). Either x i or v is selected as x i+1 , the individual for the next generation, following ε level comparison of the corresponding vectors.
ε Level comparison For prioritizing candidates, given constraints and the optimized function are taken into account. The constraint violation Φ(x) is defined as follows: where Φ(x) represents the degree of violation, with p set to one. The ε level comparison determines the order between two sets of pair (f(x), Φ(x)): , if , , where t represents a generation in DE. As a decreasing function of t, ε determines the tolerance of constraint violation and ε(t) is determined as follows: 17 where x θ is the top θ-th individual, T c determines the generation in which ε(t) becomes zero, and cp the convergence speed. During the Figure 2. Evolutionary algorithm. The steps involved in evolutionary optimization are outlined. First, a candidate v is obtained from three randomly selected individuals by a differential operation. Second, a crossover operation is applied to an individual x i and the candidate. Third, the evaluation step involves ε level comparison of v and x i and results in the next individual.
optimization, Φ(x) gradually outweighs f(x). In the initial stages of the optimization ε(t) settings enable the selection of diverse candidates but convergence of the algorithm is determined by Φ(x) becoming zero. Accordingly, T c was set to one herein.

Regression and applicability domain models
For εDE optimization, any regression function can be employed. In this study, support vector regression (SVR) 20 with ν parameter was applied and the AD was defined by one-class support vector machine (OCSVM) 21 classification with ν parameter. This parameter ranges from zero to one and defines the upper bound of the fraction of margin error and lower bound of the fraction of support vectors. AD consists of regions where the output of OCSVM is greater than or equal to zero. For both SVR and OCSVM, the radial basis function (RBF) kernel: k(x i , x j ) = exp (−γ||x i − x j || 2 ) was used. A hyper parameter set {C, ν, γ} for ν-SVR was determined by cross validation of training data on the basis of Q 2 . For OCSVM model construction, γ was set to maximize the variance of the Gram matrix consisting of the kernel function of the training data 22 and ν was set to 0.01.

Simulation data
Data points on a (x 1 , x 2 ) plane were randomly generated for x 1 : [-2 3], x 2 : [-1, 4] to yield 50 training and 20 test instances. The corresponding y values were calculated using Mishra's bird function (https://mpra.ub.uni-muenchen.de/2718/) adding a Gaussian error with a mean of zero and variance of one, defined as: Three independent trials were carried out with different random number generators. Training and test data sets were plotted on the output domain of the bird function with color-coded true y values ( Figure 3).

Compound data sets
From ChEMBL 23 (version 22), compound data sets were selected using the following criteria: Maximal assay confidence score of '9', interaction relationship type 'D'(direct), activity standard unit 'nM', activity standard type 'K i ', and activity standard relation '='. When multiple K i values were available for a compound, their geometric mean was calculated to yield its final potency value, provided all measurements fell into the same order of magnitude (otherwise, the compound was discarded). In-house implementation of substructure filters for pan assay interference compounds (PAINS) 24 and other reactive molecules were applied to eliminate compounds with potential chemical liabilities. Filtering was not critical for modeling, but active compounds with sound chemical structures were desired. From qualifying data sets, nine activity classes were randomly selected, as summarized on Table 1.
For each compound, 44 descriptors were initially calculated using RDKit (http://www.rdkit.org). These descriptors included constitutional descriptors (e.g., MW, number of rings, number of rotatable bonds), topological descriptors (e.g., Chi and Kier indices 25 ) and partial charge descriptors based on chemical graph's topology (i.e., maximum of Gasteiger/Marsali partial charges 26 ). The set of 44 descriptors and the nine compound data sets used herein are made available in an open access deposition on the ZENODO platform 27 . From correlated pairs of descriptors exceeding a correlation coefficient of 0.9, only one was chosen. For each activity class, the final number of descriptors (variables) is reported in Table 1. Compounds from each class were randomly divided into equally sized training and test data sets.

Virtual screening
To test the ability of virtual screening (VS) to identify new active compounds from optimized coordinates, ChEMBL (version 22) was used as a screening source. From 1,414,176 unique compounds passing the substructure filters, training molecules used for modeling of each activity class were removed. All remaining ChEMBL compounds provided a large screening source for VS. For screening compounds, descriptors were calculated as described above and the compounds falling inside the AD of each class-specific model were preselected. Active compounds from each activity class not used for training . Simulation data sets. In three independent trials, simulation data sets were generated. For each trial, training (black dots) and test (blue squares) data are shown with true y values produced by the bird function f(x 1 , x 2 ). represented true-positive test instances, regardless of their potency values. The calculation of descriptors for more than 1.4 million screening compounds was computationally demanding and exceeded the requirements for coordinate optimization.
For ChEMBL screening compounds including test instances, two VS ranking were generated. First, Euclidean distances to optimized coordinates were calculated. In this case, compound potency was not considered for ranking. Second, pK i values were predicted for all screening compounds using the class-specific SVR models. The latter calculations were carried out to determine if true positives were highly ranked on the basis of activity predictions. The area under the receiver operating characteristic curves (AUC) was calculated as an evaluation criterion.

Analysis protocol
Two proof-of-concept studies were carried out, one using simulation data, the other compounds and their activities. For simulation data, AD and regression models were constructed with the training data from each trial. Training data range scaling within the interval [-1,1] was applied prior to model building. For the SVR models, preferred parameter settings were determined using 10-fold cross validation. Coordinate optimization was carried using individual training data points. Optimized coordinates were evaluated on the basis of true y and maximal training data y values.
The same protocol for coordinate optimization was applied to each compound activity class. Furthermore, for hyper-parameter optimization of SVR, five-fold cross validation was carried out. For εDE, predicted pK i values falling into the AD of each model were used to ensure that optimized coordinates were proximal to compound coordinates, as assessed by distance calculations. Furthermore, optimized coordinates were projected on principal component analysis (PCA) maps of the x space formed by the first and second PC. A matched molecular pair (MMP)-based 28 analog search for compounds proximal to the optimized coordinates was also carried out to investigate if analogs exhibited lower (higher) potency when they were distant from (close to) the optimized coordinates. Based on the distance to the optimized coordinates, the top 30 compounds from both training and test data sets were selected. MMPs qualified for this analysis if participating compounds displayed a potency dif-ference of at least one order of magnitude.
The following εD E parameter settings were applied: Number of iter-ations, 1,000 and 10,000 for simulation and compounds data sets, respectively; F, 0.5; T c ,1; p, 1; CR, 0.9 for all data sets. An initial population was obtained using 50 vectors of training instances for simulation data and 511 to 1,193 vectors for training compounds, depending on the size of the compound data sets.
Finally, the ability of distance-and SVR-based VS to predict new active compounds was analyzed. Although de novo structure generation was beyond the scope of our investigation, VS might be considered as an alternative way to identify novel active compounds, which was thus examined in the context of our study.

Implementation
All SVR models and ADs were constructed with Scikit-learn 29 0.18.1 using Python. εDE was implemented in C++. Descriptors were calculated using RDKit interfaced with Python.
All selected compound entries were standardized by removal of ions and solvent molecules and structure regularization, according to the OEChem toolkit (v1.7.7; OpenEye Scientific Software, Inc. Santa Fe, NM, USA).

Results and Discussion
Differential evolution for inverse QSAR Optimization of descriptor coordinates for preferred values of a given model is a central aspect of the two-stage inverse QSAR process, for which currently only approximate solutions exist. Therefore, a more accurate methodology for coordinate optimization is highly desirable, as investigated herein. The evaluation of εDE as an optimization method for this critical task was inspired by previous results obtained for other types of optimization problems where this approach displayed better performance than alternative evolutionary methods, such as genetic algorithms or particle swarm optimization [30][31][32] . Moreover, ε-based lexicological comparison of individual feature vectors makes this algorithm straightforward to apply to problems where several constraints must be balanced, as is the case in inverse QSAR. Studies on simulation and compound data sets were designed to evaluate whether εDE was capable of effectively optimizing coordinates on the basis of a regression function.

Investigating simulation data
For initial proof-of-concept, εDE-based search for optimized coordinates was carried out using simulation data generated as described above.
As reported in Table 2, these SVR models accounted for the output of the bird function in a statistically meaningful way.
Standard deviations in the test data sets for trials 1, 2, and 3 were 14.55, 23.21, and 30.42, respectively. Figure 4 shows the  different prediction surfaces of the SVR models for the three trial sets. The surfaces of set one and two were overall similar, whereas the surface of set three differed from the others. In each case, however, individual vectors converged at a single point (Table 3) and optimized coordinates were located in regions of highest predicted y values (Figure 4). In set one, for which the SVR model overall best accounted for the bird function, a training data point was found adjacent to the optimized coordinates, which slightly exceeded the largest predicted y value (Table 3).
For set two, no training data were mapped to local maxima of the bird function, which resulted in a difficult regression scenario. The maximal measured y value in the training data was 29.46 and for optimized coordinates (1.51, 3.16), the predicted y value was 31.14, also slightly exceeding the largest measured y value. However, the predicted value was much smaller than the corresponding true y value of 48.09 (Table 3), due to the inherent regression limitation.
For set three, the maximal true y value within the domain was 56.18 at (-1.59, 0.06). In this case, several training data points were mapped to regions of y values into which optimized coordinates fell (Figure 4), leading to an extrapolative over-prediction of the corresponding y value of 71.94. However, this over-prediction was correctly balanced when the AD of the model was considered instead, leading to a value of 59.40 and a predicted y value for the optimized coordinates of 59.41 (Table 3).
Despite typical regression limitations highlighted by findings for set two and three, the results obtained for simulation data indicated the potential of εDE for predicting optimized coordinates. Importantly, all solutions converged to single vectors representing novel points in simulation data space with large predicted y values falling inside the AD. Taken together, these results indicated the principal potential of εDE for coordinate optimization on the basis of SVR modeling.
Coordinate optimization for compound data sets Next εDE optimization was applied to different compound activity classes. In each case, SVR models were derived, optimized coordinates determined, and activity values predicted.
For each compound class, optimized coordinates yielded larger predicted pK i values than any training or test set compound (Table 4), consistent with the methodological strategy. Optimized coordinates fell inside the AD of each model and were proximal to several active compounds. When setting the ν parameter value to 0.1, hence restricting compound numbers within the AD, optimized coordinates remained unchanged compared to the ν parameter setting of 0.01. Nearest neighbors of optimized coordinates were mostly predicted to be highly active (Table 4), indicating the presence of smooth prediction surfaces in the vicinity of optimized coordinates.
Prediction surfaces were further characterized graphically by systematically comparing predicted pK i values of compounds and calculated distances to optimized coordinates. Figure 5 shows the results for two exemplary activity classes, and Figure S1 shows the results for all classes. For set 51 (5-HT1a receptor ligands) in Figure 5, many highly active compounds were located proximal to the optimized coordinates, indicating that these coordinates fell into a well-populated region of activity-relevant chemical space. For set 194 (factor X inhibitors), training and test compounds tended to exhibit higher predicted pK i with decreasing distance to the optimized coordinates, hence delineating regions of activity progression, which are relevant for compound optimization and exploitation of optimized coordinates.
Data set compounds and optimized coordinates were also projected onto PCA plots of descriptor space ( Figure 6, Figure S2). These projections revealed that optimized coordinates were central to   activity class regions in feature space. Furthermore, Figure 7 shows structures of the three nearest neighbors of the optimized coordinates for sets 51 and 194. In both cases, these compounds were structural analogs. Hence, similarity in feature space corresponded to close structural relationships. Consequently, this would also apply to structure generation from optimized coordinates, which would result in additional analog(s), consistent with the principles of QSAR and inverse QSAR.
A search was carried out for analogs of the top 30 compounds based on the distance to the optimized coordinates and the consistency ratio was calculated for each target ( Table 5). The number of qualifying MMPs ranged from 14 to 464. The consistency ratios were greater than 0.5 for eight of nine activity classes, implying that structures approaching optimized coordinates may have increasing potency.
The εDE protocol can also be modified to generate a family of solutions. One possibility is splitting a data set into subsets of structurally related compounds, followed by the εDE optimization for finding the optimized coordinates on a per subset basis. Furthermore, since εDE is a methodology for solving a COP, it is also possible to incorporate multiple constraints as solubility and toxicity during the optimization process as long as these constraints are formulated as inequality equations using the same descriptors as for the objective function.  Virtual screening ChEMBL compounds were screened relative to optimized coordinates from the nine activity classes and Euclidian distances were determined. Furthermore, pK i values were predicted for screening compounds falling into the AD of each class-specific SVR model. The AD was simply defined as the region where the OCSVM output was greater than or equal to zero. Since the nu value of the OCSVM models was set to 0.01, most of the training and test compounds were classified inside the AD. For example, for data set 51, 97% of the training and 96% of the test compounds fell inside the AD. The VS calculations ultimately led to alternative distanceand potency-based compound rankings. Table 6 reports screening compound statistics and VS results. For each activity class, screening compounds contained large numbers (497-1152) of true positive test instances. Distance-based VS yielded AUC values of at least 0.6 for five of nine activity classes, with a maximum of 0.76. For the four remaining classes, essentially random predictions were observed. Potency-based VS produced AUC values of greater than 0.6 for seven of nine classes, including values above 0.7 for three classes and a maximum of 0.77. Thus, potency-based predictions led to slightly better compound rankings than distance-based VS relative to optimized coordinates, but prediction accuracy was overall moderate at best. Moreover, the true positive ratio among the 30 top-ranked compounds was generally very low for both distance-and potency-based VS. Figure 8 shows exemplary potency prediction landscapes including optimized coordinates as a reference. Figure S3 shows these representations for all activity classes. Highly potent compounds proximal to optimized coordinates were predicted for several activity classes. However, most true positives were not separated from the bulk of ChEMBL screening compounds on the basis of potency predictions. Overall the ability of VS calculations to identify novel active compounds and separate them from false positives was only limited. Thus, although de novo structure generation from optimized coordinates is challenging, it would be difficult to replace the structure generation step in two-stage inverse QSAR with standard VS calculations. However, despite limited prediction accuracy, the VS calculations provided support for the chemical relevance of optimized coordinates because for each activity class, at least few true positives were among top-scoring screening compounds and proximal to optimized coordinates.

Conclusions
The optimization of coordinates in feature space for high activity values predicted with a regression model is a central task in twostage inverse QSAR. For this multi-constraint optimization problem, no generally applicable approach is currently available. The evaluation of differential evolution for coordinate optimization, as reported herein, was motivated by the successful application of this algorithm in areas of science other than chemistry. The study has provided proof-of-concept evidence that εDE is suitable for generating optimized coordinates in given feature spaces. For different compound classes, consistent predictions were obtained for εDE in combination with SVR, displaying robust convergence behavior and yielding optimized coordinates that not only met statistical and data set requirements, but were also chemically relevant, as indicated by compound mapping and distance-or 1. This paper describes a new method for approaching one aspect of the inverse QSAR problem. As the authors point out, the inverse QSAR problem can be divided into two steps.

Open Peer Review
The generation of a set of coordinates, in some multidimensional space, that corresponds to one or more optimal compounds. The generation of molecular graphs that would produce chemical structures with descriptor values corresponding to these coordinates. This paper focuses on the first problem and does not address the second.
The paper is well written and the topic will be of interest to computational chemists working in both industry and academia. The methodology is described well and an individual with some QSAR expertise should be able to reproduce this work. However, in the interest of making it easier to reproduce the work described here, and making the method more widely available, it would be useful for the authors to make a reference implementation available. On a similar note, the authors mention that the ChEMBL datasets are available from the original source. As a service to those readers who would like to reproduce the work, it would be useful if the authors provided the specific datasets used in this paper as a download. As part of this download, the authors could also include a list of specific descriptors used and annotate which compounds were rejected due to PAINS filters.
The definition of the applicability domain used in this paper could be expanded. It would be useful for the authors to provide a specific worked example demonstrating how the applicability domain is defined for one of the ChEMBL examples described in the paper. This example could be expanded to demonstrate how a set of optimal coordinates is defined.
One other beneficial addition to this paper would be a comparison with established methods. The authors provide the results of virtual screening based on their method but do not provide a comparison with commonly used techniques. One way to do this would be to provide a simple comparison with an SVM model for activity calculated based on the same descriptions. A comparison with activity calculated based on nearest neighbors in a simple principal component space could also be provided.
A few specific comments: It is unclear what the RMSE value is in Table 2. This is on an arbitrary dataset, how should RMSE be interpreted? interpreted?
There appear to be a number of errors in the chemical structures in Figure 7.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound? Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
Prof Bajorath and I are Gateway Advisors for the channel in which the article is Competing Interests: published.
Referee Expertise: computational chemistry, cheminformatics I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. 25 August 2017 Referee Report doi:10.5256/f1000research.13238.r24700

Brian Goldman
Modeling & Informatics, Vertex Pharmaceuticals, Boston, MA, USA This is an extremely well written submission by Bajorath detailing an important aspect of the inverse et. al QSAR problem, namely the generation of a feature vector optimizing the output of a QSAR equation. An interesting and novel component of the work is that constrained optimization is utilized ensuring generated solutions lie within the domain of applicability of the QSAR model. This paper is primarily of theoretical interest in that due to inherent limitations with inverse-QSAR the technique is rarely adopted as a means of drug discovery. The primary reason for the lack of adoption revolving around the de-novo design of compounds matching optimized descriptor values.
The introduction to the paper covers the relevant literature but could be made stronger by discussing the recent work of Gómez-Bombarelli who use generative models to approach the inverse-QSAR et. al. problem. Their work concerns building and optimizing QSAR equations in the latent space of an autoencoder and subsequently decoding optimized points into molecular structures. Although the work 1 autoencoder and subsequently decoding optimized points into molecular structures. Although the work by Gómez-Bombarelli is preliminary, it addresses both QSAR optimization and structure generation and would most likely be interesting to readers of the current paper.
An aspect of the current technique that could be discussed in the paper is the generation of a family of solutions. Currently, the presented technique produces one optimized feature vector. It would strengthen the paper if the authors discussed how a family of diverse feature vectors (structures) could be evolved using the presented methodology.
Overall, the paper in its current form is more than acceptable. The methodology is clearly delineated and sufficiently supported by the included results.
Minor points: Figure 3: it would be informative to highlight the optimal point with a particular glyph (perhaps a red star?) Reference Source

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility?
No source data required This interesting contribution by Bajorath addresses an important part of the inverse QSAR problem et al. towards automated generation of structures with high activity from QSAR models. Inverse-QSAR, while intellectually appealing, does not find significant applications in modelling in the pharmaceutical industry. Main limitations are linked to de-novo structure generation due to issues with synthetic accessibility. Another hurdle is the challenge to identify the optimal descriptor setting for a model. The authors focus on this latter point, namely a novel and accurate approach for coordinate optimization. They demonstrate how to generate optimal descriptor coordinates under certain constraints like model applicability domain and meaningful descriptor values. A convincing validation study using virtual screening in ChEMBL 22 is also presented.
The report title and abstract cover the content well. The chemoinformatics approach is well conducted and clearly described. The results are presented in a clear and interesting way and capture the interest of F1000 readers. The authors might also want to mention, whether software tools and subroutines from their study are available. Therefore this contribution is an essential view on interpretation of QSAR models and should be indexed in its present form.
Some points could be addressed to highlight further aspects of their work.
How does such an optimization approach handle typical types of variables from real-life models, e.g. two-level variables, variables with small or no SD?
Often model analysis should not result in a single solution, but multiple related structures. Could the optimization approach find multiple descriptor regions to offer options for monitoring secondary properties (e.g. solubility)?
It might be illustrative for one ChEMBL target to systematically generate analogs for potent leads close to the descriptor optimum by applying simple MedChem transformations and check, whether some analogs come closer to the optimum in descriptor space. Chemical changes are minimal here and one could access their impact to the descriptor optimum.
The moderate VS success using QSAR models might suggest a non-optimal approach to define the applicability domain. Some details on the AD definition and the descriptor space might be useful. Do the authors expect that a more strict AD definition might produce reliable results? What does this mean for de-novo structure generation as second step?
Is it possible to apply such a concept for multi-parameter optimization, e.g. multiple QSAR models combined for predicting compound profiles / selectivity / druglikeness?
Minor point: Drawings of chemical structures in figure 7 need to be checked.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility?
No source data required

Are the conclusions drawn adequately supported by the results? Yes
No competing interests were disclosed.

Competing Interests:
Referee Expertise: Molecular modelling, drug design I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.