ChemMaps: Towards an approach for visualizing the chemical space based on adaptive satellite compounds

We present a novel approach called ChemMaps for visualizing chemical space based on the similarity matrix of compound datasets generated with molecular fingerprints’ similarity. The method uses a ‘satellites’ approach, where satellites are, in principle, molecules whose similarity to the rest of the molecules in the database provides sufficient information for generating a visualization of the chemical space. Such an approach could help make chemical space visualizations more efficient. We hereby describe a proof-of-principle application of the method to various databases that have different diversity measures. Unsurprisingly, we found the method works better with databases that have low 2D diversity. 3D diversity played a secondary role, although it seems to be more relevant as 2D diversity increases. For less diverse datasets, taking as few as 25% satellites seems to be sufficient for a fair depiction of the chemical space. We propose to iteratively increase the satellites number by a factor of 5% relative to the whole database, and stop when the new and the prior chemical space correlate highly. This Research Note represents a first exploratory step, prior to the full application of this method for several datasets.


Introduction
Visual representation of chemical space has multiple implications in drug discovery for virtual screening, library design and comparison of compound collections, among others 1 . Amongst the multiple methods to explore chemical space, principal component analysis (PCA) of pairwise similarity matrices computed with structural fingerprints has been used to analyze compound datasets 2,3 . A drawback of this approach is that it becomes impractical for large libraries due to the large dimension of the similarity matrix 4 . Other approaches use molecular representations different from structural fingerprints, such as physicochemical properties or complexity descriptors, or methods different from PCA, such as multidimensional-scaling and neural networks 5,6 .
In representation of the chemical space based on PCA there have been "chemical satellite" approaches, such as ChemGPS, which select satellites molecules that might not be included in the database to visualize, but have extreme features that place them as outliers, with the intention to reach as much of the chemical space as possible 7-10 . Also, a related and more recent approach, Similarity Mapplet, makes possible the visualization of very large chemical libraries, by considering PCA of different molecular features, including structural 11 .
Although we concur with the fact that not all compounds in a compound data set should be necessary to generate a meaningful chemical space, there are still obvious limitations of using a fixed set of satellites to which the user is blinded. Also, until now there was no proposal of such a method based on structural similarity.
We therefore suggest the hybrid approach, ChemMaps, in which a portion of the database to be represented is used as satellite, thereby decreasing the computational effort required to compute the similarity matrix without losing adaptability of the method to any particular database. Since it is expected that more diverse sets would require more satellites, a second goal of this study was to qualitatively explore the relationship between the internal diversity of compound datasets and the fraction of compounds required as satellites, in order to generate a good approximation of the chemical space. Table 1 summarizes the six compound data sets considered in this study. Note that small median similarity values imply higher diversity. The datasets were selected from a large scale study of profiling epigenetic datasets (unpublished study, Naveja JJ and Medina-Franco JL) with relevance in epigenetic-drug discovery. We also included DrugBank as a control diverse dataset 12 . Briefly, we selected focused libraries of inhibitors of DNMT1 (a DNAmethyltransferase; library diverse 2D and 3D), L3MBTL3 (a histone methylation reader; diverse 3D and less diverse 2D), SMARCA2 (a chromatin remodeller; diverse 2D, less diverse 3D), and CREBBP (a histone acetyltransferase; less diverse both 2D and 3D). Datasets were selected based on their different internal diversity (as measured with Tanimoto index/MACCS keys for 2D measurements and Tanimoto combo/OMEGA-ROCS for 3D; see Figure S1 in Supplementary File 1). Data sets in this work have approximately the same number of compounds except for HDAC1 and DrugBank, which were selected to benchmark the method in larger databases (Table 2). We evaluated 2D diversity using the median of Tanimoto/MACCS similarity measures in KNIME version 3.3.2, and 3D diversity using the median of Combo Score from the ROCS, version 3.2.2 and OMEGA, version 2.5.1, OpenEye software 13-16 .

Amendments from Version 1
We discuss further in the Introduction, the differences of ChemMaps with other similar approaches.
We updated the Figure 1- Figure 3 for better visibility. Dataset 1 has been updated to also contain HDAC1 compounds used in the study.
We have expanded the perspectives of the work in the Conclusion.
The Supplementary File has been updated with Supplementary  Methods, Supplementary Results and Table S1, containing the curation of the database and PCA details. Supplementary Figure S1-Supplementary Figure S4 have been revised, and we added a new Supplementary Figure 5 comparing the variance percentage contribution of the PCs for each studied database.

REVISED
To assess the hypothesis of this work we performed two main approaches A): Backwards approach: start with computing the full similarity matrix of each data set and remove compounds systematically; and B) Forward approach: start adding compounds to the similarity matrix until finding the reduced number of required compounds (called 'satellites') to reach a visualization of the chemical space that is very similar to computing the full similarity matrix. The second approach would be the usual and realistic approach from a user standpoint. Each method is further detailed in the next two subsections.

Backwards approach
The following steps were implemented in an automated workflow in KNIME, version 3.3.2 17 : 1. For each compound in the dataset with N compounds, generate the N X N similarity matrix using Tanimoto/extended connectivity fingerprints radius 4 (ECFP4) generated with CDK KNIME nodes.
2. Perform PCA of the similarity matrix generated in step 1 and selected the first 2 or 3 principal components (PCs).
3. Compute all pair-wise Euclidean distances based on the scores of the 2 or 3 PCs generated in step 2. The set of distances are later used as reference or 'gold standard'. It should be noted that the "real" distances or true gold standard would consider the whole distance matrix. However, for visualization purposes it is unfeasible to render more than 3 dimensions. Therefore, we selected as reference the best 2D or 3D visualization possible by means of PCA.
4. Repeat steps 1 to 3 with one compound as satellite, generating an N X 1 similarity matrix. The first compound was selected randomly. In this case, for example, it is only possible to calculate one PC, but as the number of satellites increases, we can again compute 2 or 3 PCs.
5. Calculate the correlation among the pairwise distances generated in step 2 obtained using the whole matrix (e.g., gold standard) and those obtained in step 4.
6. Iterate over steps 4 and 5 increasing the number of satellites one by one until N -1 satellites are reached. To select the second, third, etc. compounds, two approaches were followed: select compounds at random and select compounds with the largest diversity to the previously selected (i.e., Max-Min approach).
7. Estimate the proportion of satellite compounds required to preserve a 'high' (of at least 0.9) correlation.
8. The prior steps were repeated five times for each dataset in order to capture the stability of the method.

Forward approach
The former approach is useful only for validation purposes of the methodology as a proof-of-principle. However, the obvious objective of a satellite-approach is to avoid the calculation of the complete similarity matrix e.g., step 1 in backwards approach. To this end, we developed a satellite-adding or forward approach, in contrast with the formerly introduced backwards approach. We started with 25% of the database as satellites and for each iteration we added 5% until the correlation of the pairwise Euclidean distances remains high (at least 0.9). A further description of the methods for standardizing the chemical data and integrating the dataset can be found in the Supplementary material, as well as a further description of the PCA analysis used.

Backwards approach
In this pilot study, we assessed a few variables to tune up the method, such as the number of PCs used (2 or 3) and the selection of satellites at random or by diversity. We found that selection at random is more stable, above all in less diverse datasets ( Figure 1 and Figure 2; Figure S2 and Figure S3). Likewise, selecting 2 PCs the performance is slightly better and more stable (compare Figure 1 and Figure 2 against Figure S2 and Figure S3).
Therefore, from this point onwards we will focus on the results of the at random satellites selection and using 2 PCs ( Figure 2). From the four datasets, we conclude that for datasets with lower 2D diversity (CREBBP and L3MBTL3, see Table 1), around 25% of satellite compounds are enough to obtain a high correlation (≥ 0.9) with the gold standard (e.g., PCA on the whole matrix), whereas for 2D-diverse datasets i.e., DNMT1 and SMARCA2, up to 75% of the compounds could be needed to ensure a high correlation. Nonetheless, even for these datasets, using 25% of the compounds as satellites the correlation with the gold standard is already between 0.6 and 0.8; using 50% of the compounds as satellites the correlation is between 0.7 and 0.9. Hence, the higher the diversity of a dataset (especially 2D), the higher the number of satellites required.

Forward approach
Evidently, a useful method for reducing computing time and disk space usage should not use the PCA on the whole similarity matrix    to determine an adequate number of satellites for each dataset. With that in mind, we decided to design a method that starts with a given percentage of the database as satellites, and then keeps adding a proportion of them until the correlation between the former and the updated data is of at least 0.9. In Figure 3 we depict this approach on the same databases in Table 1 for step sizes of 5% and starting from zero. Similarly as what we saw in the backwards method, around 5 steps (25% of the database) are usually necessary to reach a stable, high correlation between steps. Figure S4 shows that for step sizes of 10% there is no further improvement. Therefore we suggest that the method should, for default, start with 25% of compounds as satellites and then keep adding 5% until a correlation between steps of at least 0.9 is reached.

Application
In this pilot study we applied the ChemMaps method to visualize the chemical space of two larger datasets (HDAC1 and DrugBank with 3,257 and 1,900 compounds, respectively, Table 1). As shown in Table 2, a significant reduction in time performance was achieved as compared to the gold standard, and the correlation between the gold standard and the satellites approach was in both cases higher than 0.9. Figure 4 depicts the chemical spaces generated in both instances. Although the orientation of the map changed for HDAC1, the shape and distances remain quite similar, which is the main objective. This preliminary work supports the hypothesis that a reduced number of compounds is sufficient to generate a visual representation of the chemical space (based on PCA of the similarity matrix) that is quite similar to the chemical space of the PCA of the full similarity matrix.

Conclusion and future directions
This proof-of-concept study suggests that using the adaptive satellite compounds ChemMaps is a plausible approach to generate a reliable visual representation of the chemical space based on PCA of similarity matrices. The approach works better for relatively lessdiverse datasets, although it seems to remain robust when applied to more diverse datasets. For datasets with small diversity, fewer satellites seem to be enough to produce a representative visual representation of the chemical space. The higher relevance of 2D diversity over 3D in this study could be importantly related to the fact that the chemical space depiction is based on 2D fingerprints. Therefore, the performance of the methods depicting the chemical space based on 3D fingerprints could also be assessed.
A major next step is to conduct a full benchmark study to assess the general applicability of the approach proposed herein, and also in larger databases, in which we anticipate this method would be even more useful. A second step is to propose a metric that determines the number of compounds required as satellites for PCA representation of the chemical space based on similarity matrices. As well, it is pending the development of quantitative metrics for assessing the stability of the satellites selection and thus conclusively establish the superiority of at random satellite selection. Finally, a more comprehensive and in-depth study of this new methodology should be addressed, in order to further characterise its applicability domain, including a dataset diversity threshold above which the confiability of the approach decreases. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Supplementary material
Supplementary File 1: File with supporting methods, results and five figures. Figure S1: 3D-Consensus Diversity Plot depicting the diversity of the datasets used for the backwards approach; Figure S2: Backwards analysis with 3PCs picking satellites by diversity; Figure  S3: Backwards analysis with 3PCs picking satellites at random; Figure S4: Forward analysis with 2PCs picking satellites at random with step sizes of 10%; Figure S5: Plot of the percentage of variance explained by each principal component in the studied datasets.
Click here to access the data. 1.

2.
3. Although the main issue I raised in my earlier review regarding the 'gold standard' was not addressed, the authors did make mention of the issue, and they did address some of it in Figure S5. While I'm not totally satisfied with this, I feel that their work is ready for publication, but with the reservation regarding the issue just mentioned. This issue should be considered in more detail in their future work.

Open Peer Review
The authors have made some helpful improvements in their manuscript, but there are still some issues they may want to consider to further improve it, although it is not necessary for them to do so.
Step 1 in the 'Backwards approach' is unclear. Perhaps an example would help clarify exactly what is being done. I could not reproduce Step 1 as it is currently written.
In step 3 of the 'Backwards approach' the authors state that: Compute all pair-wise Euclidean distances based on the scores of the 2 or 3 PCs generated in step 2. The set of distances are later used as reference or ' '. It should be noted that the gold standard "real" distances or true gold standard would consider the whole distance matrix. However, for Therefore, we selected as visualization purposes it is unfeasible to render more than 3 dimensions. reference the best 2D or 3D visualization possible by means of PCA.
With regard to the underlined text in the authors' statement above, I believe they are missing the point. The reason for considering the full or a significant subspace for a given compound collection and not the 2-or 3-dimensional subspaces derived from a PCA has nothing to do with the impossibility of rendering the data in more than three dimensions because this 'gold standard' is not going to be graphically depicted. The important point is that the distances calculated in this space are exact within the limitations of the methodology and means of data collection used. Hence, they form the best 'gold standard' that can be achieved within these limitations. The difficulty with this approach is the need to compute Euclidean distances in the full higher-dimensional space, which may require some additional work. However, this distance matrix need only be computed once. I feel the authors should consider this in the future work. Alternatively, the authors need not necessarily involve the complete space, but they should chose a subspace of sufficient dimension to ensure that a significant percentage of the variance, say greater than 90%, is accounted for as a basis for the gold standard.

2.
In Step 8 the terminology 'prior steps' is used. Does this include all prior steps? I think it would be clear to name the actual steps as is done in earlier steps in this section, e.g. step 4 states "Repeat steps 4 and 5 increasing…" Although unnecessary, might the authors comment on less well behave character associated with the iterations of the DNMT1 inhibitors dataset compared to the other datasets depicted in Figure 1.
Line 2 in the 'Forward approach' should read: "… e.g., step 1 in backwards approach." the If I understand the %variance plots in Figure S5 correctly, it seems that the %variance for two PCs are for three of the databases greater than or equal to 50%, but three are less than 50%, with the SMARCA2 database only reaching about 20%. Is the %variance in the latter three cases, but especially for the SMARCA2 database, sufficient to provide a basis for a reasonably faithful representation for all of the respective datasets?
No competing interests were disclosed.

Competing Interests:
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.   Figure S2, it seems to me that 3 PCs are at least not worse than 2 PCs, and maybe even slightly better on the basis of smaller difference between runs. Thus, without a quantitative measure it cannot be said that performance of 2 PCs is better, it may be only said that 3 PCs do not provide improvement over 2 PCs (though it is not obvious from the plots), and if 3 PCs scheme has higher computational demands, it may also be mentioned.
The compound standardisation procedure (Supplementary Methods) looks as it is taken from a different manuscript. For example, compound activity is mentioned, although not used in this study. It is also not stated that standardisation procedure was not applied to DrugBank (at least to the SDF file available in the supplementary dataset).
No competing interests were disclosed. Thanks for citing our paper on similarity maps. However the key point is missing: in our 2015 paper (Awale & Reymond 2015) we computed similarity values for all ChEMBL molecules (> 1 million) to only 100 molecules used as reference (the same as "satellites" here), and not the full similarity matrix. We then performed a PCA of the resulting similarity fingerprint to obtain a 2D-map (see the interactive "similarity-mapplets" web portal at www.gdb.unibe.ch). We have also published another implementation of similarity maps to visuzalize the Protein DataBank, using the same principle of choosing a limited set of sattelites only (Jin 2015). et al.

5.
standard compounds or satellites. The set of satellites would be "dynamic" and dependent on the data sets.
No competing interests were disclosed. The authors compare their satellites to the satellite compounds used by T. Oprea in his 2001 approach to mapping chemical space. Obviously either they did not read Oprea's paper or they misunderstood it: Oprea's satellites are artificial molecules with extreme properties such as to orient the PCA projection and stretch its dimensions in reproducible directions. However the projection is simply PCA, and does not involve similarity mapping. In similarity mapping the satellites are molecules from within the database to which similarities are calculated.
In the abstract, author mentioned that "3D diversity played a secondary role, although it becomes increasingly relevant as 2D diversity increases". However, I didn't found the relevant explanation in main text supporting this statement. In case of forward selection approach: "..With that in mind, we decided to design a method that starts with a given percentage of the database as satellites, and then keeps adding a proportion of them until the correlation between the former and the updated data is of at least 0.9. " The correlation between projections obtained from the current set of satellites and projections obtained 5. 6.
correlation between projections obtained from the current set of satellites and projections obtained from former set of satellites might well be high, but still the correlation to the projection obtained from the complete similarity matrix is low. How one can assure the quality of projection in this case?
For all plots axis labels are too small to read. Regarding the modifications we have done considering your comments: In the Introduction we briefly discuss other similarity approaches to visualize the chemical space. We have expanded that discussion there with a reference to the Similarity Mapplet approach. 1.

4.
We did not intend to imply that Oprea's and Gottfires' ChemGPS approach is based on structural similarity. To clarify this point we rephrased that in the introduction.
In the corresponding Figures legends we changed "random sets" with "iterations".
We added to the Supplementary Information a discussion on the correlation of the complete similarity matrix Euclidean distances and using only 2 and 3PCs. However, we would like to highlight that our approach is intended to approximate the best possible chemical space visualization using PCA. This last is given by the first 3PCs at most.
We augmented the font size in all figures.
No competing interests were disclosed. Competing Interests: 28  The paper under consideration presents an elegant approach to efficient mapping of chemical space using principal component analysis. Being technically sound in general, well-written and easily understandable, the paper lacks several technical details without which it is not complete. In particular: The concept of 'chemical satellites' is discussed in a rather concise manner, a bit more details may be added and the seminal paper by Oprea & Gottfries [1] needs to be cited. The approach suggested here is rather different from the Oprea's one, because satellites are defined there as intentional outliers, whereas in the current work they are just extracted from the mapped dataset. This difference should be stated in a clearer way.
Dataset processing routine is not presented. Although the suggested technique would work on totally random datasets (by the way, addition of such a dataset to the list of examples would be beneficial and illustrative), standardization of structures should be performed for consistency and for more informative application of similarity measures. Targeted datasets in the supplement look standardized, but DrugBank contains metal ions, unconnected molecules, and macromolecules, all of which may significantly distort the comparison. For HDAC1 inhibitors the procedure to obtain this dataset from ChEMBL should be provided, because simple target keyword search for 'hdac1' gives 9 different datasets.
Diversity of datasets may be additionally illustrated by any of currently available visualization methods. A method that clearly shows compound clustering or diversity of the dataset would be preferred.
Visual comparison of figures is not sufficient to make conclusions about preference of random 1,2 1 2 4.

5.
Visual comparison of figures is not sufficient to make conclusions about preference of random selection over diversity-based (Figures 1, 2, S2, S3). Differences are visible, but their importance and significance are not obvious (maybe just for me), so use of a quantitative measure would be highly appreciated. Random selection shows sometimes lower stability of the backwards analysis (larger difference between the iterations), and this observation could be discussed.
We added a citation to the first publication related to ChemGPS by Oprea and Gottfries. In the Introduction, we further, although briefly (given the extension limit of a Research Note), explained the differences among these two approaches.
We added a Supplementary Information file describing the data curation methodology used.
-We also added the HDAC1 dataset to the supplementary files.
Supplementary Figure 1 should address the visualization of the diversity of the datasets.
We find quite interesting your observation about quantifying the stability of the iterations, as well as that about determining the applicability domain of the approach (including defining a diversity threshold). Based on this Research Note we are planning an extensive study fully addressing these concerns.
No competing interests were disclosed. Competing Interests: 20  Graphically representing coordinate-based chemical spaces requires some type of dimensionality reduction. One method involves the use of similarity matrices treated as data matrices that are subsequently subjected to principal component analysis (PCA). The first two or three PCs are then used as a basis to graphically depict the chemical space. Although this approach works reasonably well, the size of chemical spaces that can be treated is somewhat limited, since the PCA transformation requires diagonalizing a matrix whose dimension is equal to the number of molecules in the chemical space of interest. The work of Naveja and Medina-Franco seeks to overcome this limitation by building a lower dimensional representation of chemical space in a stepwise manner using "backwards" or "forward" procedures. While the method has the potential for accomplishing their goals, it does not in my estimation provide a sufficiently rigorous test of the approximations that are the foundation of their approach. For this reason additional work needs to be done before their method can be applied with confidence.
My objection is based on the authors' use of the first 2 or 3 PCs as the 'gold standard' for representing of the entire chemical space, and as a basis for all subsequent comparisons of the approximate chemical spaces. I would at least like to see what percent of the total sample variance is accounted for by these PCs. If it is an insignificant amount, then approximating these PCs by whatever method will not produce a sufficiently accurate model of the chemical space and their model will have to be improved. The true 'gold standard' is the original set of column vectors in their data matrix from which the PCs are obtained. This will produce the 'true' distance between 'molecular points' in the full dimensional chemical space, but because of its very high dimension computing distances in the original chemical space can be a problem. An alternative is to carry out the PCA and choose a larger subset of PCs (say 6 or 8) that do account for most of the sample variance and then use these in the correlation or error analysis.
Is the work clearly and accurately presented and does it cite the current literature?