Keywords
Consensus Diversity Plot, compound databases, data mining, diversity, natural products, functional groups, in silico
This article is included in the Cheminformatics gateway.
This article is included in the Mathematical, Physical, and Computational Sciences collection.
Consensus Diversity Plot, compound databases, data mining, diversity, natural products, functional groups, in silico
We thank all three reviewers for the constructive comments and suggestions to improve the study and the manuscript. In the revised version, we repeated the analysis standardizing the chemical structures using MolVS. The standardization procedure is described in the revised “Methods, Databases and curation” section; the code was made available at GitHub. To visualize the chemical space of BIOFACQUIM and reference compounds databases (entire sets and drug-like subsets) we employed the recently published method TMAP (Tree Manifold Approximation and Projection) using Morgan fingerprints with radius 2. An analysis of the distribution of the fraction of carbon atoms that are sp3 hybridized was added as a new subsection “Molecular complexity”. Further details of the BIOFACQUIM database were added (source of compounds and names of the peer-reviewed journals where the compound information was retrieved from). The English language of the entire manuscript was revised.
See the authors' detailed response to the review by Trong D. Tran
See the authors' detailed response to the review by Johannes Kirchmair
See the authors' detailed response to the review by W. Patrick Walters
Natural product-based drug discovery continues to be an important part of drug discovery. Recently, the synergy between natural product research with molecular modeling and chemoinformatics is gaining importance, speeding up the drug discovery process1,2. As part of these synergistic efforts, curated databases of natural products have an important role as they are major tools for data mining, hypothesis generation, and starting points of virtual screening. There are several databases of natural products in the public domain as reviewed recently3. Our research group has reported initial efforts to assemble a database of natural products from Mexico called BIOFACQUIM4. As part of that work, scaffold content and chemical space diversity were examined. However, detailed functional group (FG) content analysis, which has been proven to be valuable to characterize compound databases5, in particular from natural sources6, has not been reported for BIOFACQUIM. One of the main reasons is that most of the currently available software employed for identification of functional groups rely on a predefined set of substructures, even when it has been established that one of the major features that discriminate natural products from synthetic compounds are their unique functional groups.
Herein, we report a functional group content analysis of an updated version of BIOFACQUIM. We employed a validated and novel algorithm that identifies all functional groups in a molecule. As part of the analysis and to compare the results of BIOFACQUIM we also discuss the functional group contents of other large and related databases in the public domain, namely ChEMBL 257 and a herein assembled collection of natural products (NPs) with 169,839 compounds.
As described elsewhere, the first version of BIOFACQUIM was developed as a proof-of-concept database applying several filters to include compounds4. Briefly, the database was focused on natural products published between 2000 and 2018 by research groups in a major Mexican institution in eight indexed journals: Journal of Ethnopharmacology, Natural Products Research, Journal of Agricultural and Food Chemistry, Journal of Natural Products, Planta Medica, Phytochemistry, Natural Product Letters, and Molecules. As additional criteria for inclusion of compounds and to increase the quality and reliability of the contents of the database, the procedure for the isolation, purification, and characterization of the natural product should have been described in the article. In this work, we expanded the contents of the BIOFACQUIM database to further explore the diversity of natural products from Mexico.
The second version of BIOFACQUIM was assembled using the same methodology described to develop the first version4 extending the date of publication to 2019. To achieve the objective of being representative of Mexico, one additional criterion was considered, including only compounds collected in Mexico at any of its institutions (universities, research laboratories and research centers). For the new version of the database, the same procedure for the curation was performed4, using Molecular Operating Environment (MOE) software, although this procedure can be performed using open source software, such as MolVS and RDKit. The updated and curated version of BIOFACQUIM contains 531 compounds.
Table 1 summarizes the information of BIOFACQUIM and other major compound databases used in this work as reference: ChEMBL 25 as a representative example of the biologically tested chemical space with 1,667,509 unique compounds; and a collection of known natural products with a total of 168,030 molecules. The reference natural product collection was assembled from three general and publicly available natural products databases: the Universal Natural Products Database (UNPD)8, the Natural Products Atlas9 and Natural Products in PubChem Substance Database10. The data sets were curated using the same procedure. Briefly, compounds were standardized and those consisting of multiple components were split and the largest component was retained. Compounds consisting of any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br and I, as well as compounds with valence errors, were removed from the data set. The remaining compounds were neutralized and reionized to subsequently generate a canonical tautomer. Finally, canonical simplified molecular-input line-entry system (SMILES) (ignoring stereochemistry information) were generated as molecular representation and duplicate structures in the context of each database were removed. The entire process was performed by using the functions Standardizer, LargestFragmentChoser, Uncharger, Reionizer and TautomerCanonicalizer implemented in the molecule validation and standardization tool MolVS for the open source cheminformatics toolkit RDKit. The code is available at GitHub (https://github.com/DIFACQUIM/IFG_General).
Overlap of BIOFACQUIM with the databases selected as reference was assessed in terms of three different structural levels: compounds, scaffolds, and functional groups. Compound overlap was determined in terms of canonical SMILES. For scaffold comparison we use the definition proposed by Bemis and Murcko11 as implemented in RDKit, while for functional group overlap we selected the recently published definition and implementation suggested by Ertl5. For each structural level, we identified the unique structures belonging to each dataset as well as those belonging to two or three of them.
For the functional group content analysis we selected the algorithm recently described by Ertl5, which is able to identify all functional groups in a molecule based on an iterative marching through its atoms. In short, the proposed algorithm identifies all heteroatoms in a molecule, all atoms connected by multiple bonds as well as the atoms in oxirane, aziridine, and thiirane rings. Afterwards, all connected atoms are joined together to form a functional group. Single aromatic heteroatoms are retained only if they are connected to an additional aliphatic functionality. Finally, a generalization scheme is applied in which for a defined list of common FGs, information about the parent carbon is retained (e.g. to differentiate between alcohols and phenols) as well as hydrogen atoms (e.g. to differentiate between aldehydes and ketones). The method is fully described in 5. An open source version of this algorithm is available for Python ( https://github.com/rdkit/rdkit/tree/master/Contrib/IFG); however, it does not cover the generalization scheme proposed originally. To this end and based on the code available, we implemented with RDKit a fragmentation approach considering keeping the parent carbon and hydrogen atoms proposed originally, were the remaining carbon atoms are replaced by dummy atoms. This implementation works over a SMILES string and returns a list with the canonical SMILES of the FGs identified in the molecule. The code is freely available at GitHub ( https://github.com/DIFACQUIM/IFG_General). After determining the FGs content of the different datasets, we compare the proportion of the most frequent FGs at each library.
The fraction of carbon atoms that are sp3 hybridized (F-sp3) is a common metric to quantify molecular complexity12. Higher values for this descriptor have been associated to an improved binding selectivity of compounds13. In order to compare the complexity of BIOFACQUIM with the data sets selected as reference, we computed the F-sp3, as implemented in RDKit, for all compounds in the three data sets and compared its distribution among libraries.
In order to generate a visual representation of the chemical space covered by the analyzed databases, we selected a recently proposed method, named TMAP14 (Tree Manifold Approximation and Projection). This method enables the visualization of up to millions of data points with high dimensionality as a two-dimensional tree and has shown to be better suited than t-distributed Stochastic Neighbor Embedding15 (t-SNE) and Uniform Manifold Approximation and Projection16 (UMAP) for the exploration of large datasets. TMAP consists in four phases: (I) the input data are indexed in an local sensitive hashing forest data structure, using l prefix trees and d hash functions in encoding the data, (II) an undirected weighted c-approximate k-nearest neighbor graph (c-k-NNG) is constructed from the indexed data points with the Jaccard distances between vertices used as edges weights, (III) a minimum spanning tree (MST) is constructed for the weighted c-k-NNG using Kruskal’s algorithm and (IV) a layout for the resulting MST is constructed by using a spring-electrical model layout algorithm with multilevel multipole-based force approximation as provided by the modular C++ library, open graph drawing framework17 (OGDF).
For the description of compounds, we selected Morgan fingerprints with radius 2 (Morgan2, 1024-bits) as implemented in RDKit. For the generation of the TMAP, the input data was encoded by 1024 hash functions and indexed using 64 prefix trees. The weighted c-k-NNG was built using the 5 nearest neighbors and a factor of 20 for the LSH forest query algorithm. For the layout generation a node size of 0.01 was selected while all the remaining parameters were set to default. These calculations were done using the TMAP python package and all the charts were generated using the matplotlib library18.
The “global” or total diversity of the datasets was analyzed through the Consensus Diversity (CD) Plot19. A CD Plot is a two-dimensional representation of compound datasets based on four different and complementary diversity criteria: molecular fingerprints, molecular scaffolds, physicochemical properties (PCP), and size. Fingerprint-based diversity of each dataset is represented in the X-axis, while scaffold-based diversity is represented in the Y-axis, PCP-based diversity is represented as the filling of the data points using a continuous color scale and the size of the data set is represented with the size of the data points.
For this work, scaffold diversity was assessed as the area under the cyclic system retrieval curve and the fraction of chemotypes that covers 50% of the dataset (F 50). The median of the lower triangle from the pairwise similarity matrix computed as the Tanimoto coefficient of both MACCS keys (166-bits) fingerprint and Morgan2 (1024-bits), were used as molecular fingerprint-based diversity. For PCP-based diversity, six molecular properties of pharmaceutical interest were computed for each unique compound, being averaged molecular weight (AMW) partition coefficient octanol/water (SlogP), number of hydrogen bond donors (HBD), number of hydrogen bond acceptors (HBA), number of rotatable bonds (RB), and topological polar surface area (TPSA); PCP-based diversity was measured as the mean distance of the lower triangle of the pairwise distance matrix computed as the Euclidean distance of those PCPs scaled (mean 0 and unit variance). The number of compounds in each dataset was selected as measure of the size-based diversity. PCPs were calculated as implemented in RDKit and the CD Plot was constructed using R.
As described in the Methods section, the updated BIOFACQUIM database contains the chemical structure of 531 compounds, all collected from Mexico. As with the first version4, each molecule is annotated with information of the chemical structure, the original source of the information (Digital Object Identifier, DOI, to reference paper), kingdom, genus, and species of the organism from which the natural product was isolated, place of collection (city and state), and activity value of the reported biological activity. From the original dataset containing 423 compounds, 40 were discarded since they were not collected in Mexico, which means an increase of 148 unique compounds compared to the previous release of the database. The sources of the 531 compounds are distributed as follows: 406 from plants, 97 from fungus, 15 from propolis and 13 from marine animals.
To assess the chemical space not covered by ChEMBL and NPs but by BIOFACQUIM, we characterized the structural content of the three datasets in terms of unique compounds, scaffolds and functional groups and determined the overlap among them. Figure 1 depicts Venn diagrams showing the overlap among those datasets. It should be noted that despite its small size in comparison with ChEMBL and NPs, 15.7% of compounds present in BIOFACQUIM have not been reported in those other major datasets, as well as 11.6% of its scaffolds. Figure 2 shows the structure of representative scaffolds identified in at least 2 compounds from BIOFACQUIM, all of them associated to compounds isolated from different plants. Scaffold a associated to hexasaccharides of convolvulinic and jalapinolic acids isolated from Ipomoea purga20, scaffold b corresponding to glycosides of 4-phenylcoumarin isolated from Exostema caribaeum21, scaffold c representing karwinaphthopyranones A2 and B3 isolated from Karwinskia parvifolia22 and scaffold d associated to batatins X and XI isolated from Ipomoea batatas23. Another remarkable observation is the fact that most of the overlap of BIOFACQUIM with the other datasets involves NPs, either alone or in combination with ChEMBL, being 79.9%, 85.3%, and 100% for compounds, scaffolds and FGs, respectively.
Datasets content was analyzed in terms of (a) Compounds, (b) Scaffolds and (c) Functional Groups.
A systematic analysis of functional groups was carried out over BIOFACQUIM and the two datasets selected as reference. 62, 3664, and 11446 functional groups were identified in BIOFACQUIM, NPs and ChEMBL datasets, respectively (the overlap between them is shown in Figure 1). From the total number of functional groups present in each dataset, only 12, 15, and 22 were present in at least 1% of the corresponding library (19.4%, 0.4% and 0.2%, respectively) while 30, 1879, and 5212 (48.4%, 51.3% and 45.5%, respectively) were singletons. This result is consistent with the typical power law observed in other databases6. The most frequent FGs present in BIOFACQUIM are oxygen-containing FGs, being the phenolic hydroxyl group (46.1%), followed by ether (41.4%), alcohol hydroxyl group (38.4%), alkene (28.6%) and ester (26.8%), which although in a different order, are the most frequent FGs in the herein assembled NPs collection and in other natural product libraries6. This is in contrast to ChEMBL in which only ether is part of the most frequent FGs while the rest of them are nitrogen containing FGs and halogens. The complete results of the FGs found of the datasets is included as Extended data (Supplementary File 1).
As described in the Methods section, molecular complexity of BIOFACQUIM, NPs and ChEMBL data sets were assessed through the F-sp3 of its compounds. Figure 3 depicts letter plots for the distribution of this descriptor among data sets. This shows that NPs is the more complex data set overall regarding to this metric, with a median of 0.60 and a long tail towards low values. In contrast, ChEMBL is the less complex data set, with a median F-sp3 of 0.31 and a long tail towards high values. BIOFACQUIM is in an intermediate position with a median of 0.42 and a more symmetrical distribution, which is consistent with its small size in comparison to the other sets and its major overlap with NPs.
For visualization of the chemical space of the datasets compared in this work, we built a TMAP as described in the Methods section. This method allows the representation of large datasets as a two-dimensional tree. TMAP shows the relationships among subsets of data points and data points itself as branches and sub-branches, so similar compounds and clusters tend to be close in the final representation even if the tree edges are not included, for that reason all charts in this work do not show the tree edges. Figure 4 shows a visual representation of the chemical space of the three datasets analyzed in this work, including the whole datasets and drug-like subsets. Drug-like compounds for each subset were defined those complying with the Lipinski “rule of 5”24 and Veber criteria25 (AMW ≤ 500, -1 ≤ SlogP ≤ 5, HBA ≤ 10, HBD ≤ 5, RB ≤ 10 and TPSA ≤ 140). For this plot, in order to better illustrate the unique compounds present in BIOFACQUIM, all compounds belonging to more than one library were assigned to a single one: ChEMBL if they belong to this dataset, NPs if they belong to this dataset but not to ChEMBL and BIOFACQUIM if they were unique for this library. Figure 4 shows that the chemical space covered by the analyzed datasets is practically defined by ChEMBL and highly focused in the upper right section of the plot, meaning that the biologically relevant space does not cover evenly the available chemical space. It is also shown that NPs cover, in a sparser manner, the same space as ChEMBL. Unique compounds from BIOFACQUIM on the other hand are distributed only in the less populated regions of the space, and even in the outer region of the plot, which implies the presence of few similar compounds in the other datasets. All these observations are equally applicable for the drug-like subsets from the original data sets, which represent 44.3%, 48.4% and 69.1% of BIOFACQUIM, NPs, and ChEMBL, respectively.
Comparison of BIOFACQUIM with two reference datasets. Panels a–d show all compounds in the datasets, panels e–h show drug-like compounds only. Panels a and f show the distribution of compounds from the three data sets among the chemical space in a continuous color scale.
In order to compare the chemical diversity of the current version of BIOFACQUIM with the previous one and the two datasets selected as reference, we employed a CD Plot. Figure 5 shows the plot comparing the diversity of all datasets considering four different criteria: scaffolds in the y-axis, molecular fingerprints in the x-axis, physicochemical properties as the filling of the data points in a continuous color scale, and number of compounds as the data points size. This comparison shows the relatively small size of BIOFACQUIM in comparison with the reference datasets. As compared to the previous release of BIOFACQUIM, the current version has increased its diversity in terms of scaffolds and fingerprints but decreased in terms of physicochemical properties. Also, it is shown that its diversity in terms of molecular fingerprints and physicochemical properties, although not the greatest ones of the three datasets, are closer to the ones for NPs, contrary to scaffolds, in which is the most diverse.
The molecular fingerprint diversity of each data set is represented on the x-axis and was defined as the median Tanimoto coefficient of MACCS keys (166-bits) fingerprint. The scaffold diversity of each database is represented on the y-axis and was defined as the area under the corresponding cyclic system retrieval curve. The diversity based on physicochemical properties (PCP) was defined as the mean euclidean distance of six scaled physicochemical properties (SlogP, TPSA, AMW, RB, HBD, and HBA) and is shown as the filling of the data points using a continuous color scale. The number of compounds is represented by the size of the data points.
The current version of BIOFACQUIM involved the addition of 148 natural products. This was reflected in a diversity increase based on both scaffolds and molecular fingerprints. It was shown that in terms of diversity, structural content overlap and complexity, BIOFACQUIM is more similar to the assembled set of natural products than to the set of biologically tested compounds. The herein reported chemoinformatic study revealed that 44.3% of the unique compounds contained in BIOFACQUIM are focused in the drug-like space in terms of physicochemical properties. Interestingly, despite the fact of its relative small size, there were identified a significant number of compounds and scaffolds (79 and 29, respectively) that were not present in the two large sets used as reference, showing that curated databases of natural products, such as BIOFACQUIM, can serve as a starting point for the study and increase of the biologically relevant chemical space.
Figshare: BIOFACQUIM_V2. http://doi.org/10.6084/m9.figshare.11312702
This file contains the chemical structures of 531 compounds in SDF format, alongside ID number, compound name, simplified molecular input line entry system, literature reference, kingdom, genus, species, geographical location and biological activity.
Underlying data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
Figshare: Supporting information for "Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database". http://doi.org/10.6084/m9.figshare.11312735
This project contains the following extended data:
Extended data are available under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
We thank Roberto Hernández for his valuable contribution toward the update of BIOFACQUIM.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: cheminformatics, natural products research
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Natural product chemistry, Drug Discovery.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational Chemistry, Cheminformatics, Drug Design
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: cheminformatics, natural products research
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 08 Jun 20 |
read | ||
Version 1 10 Dec 19 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)