Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database

Norberto Sánchez-Cruz; B. Angélica Pilón-Jiménez; José L. Medina-Franco

doi:10.12688/f1000research.21540.1

Home Browse Functional group and diversity analysis of BIOFACQUIM: A Mexican natural...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database

[version 1; peer review: 2 approved, 1 approved with reservations]

Norberto Sánchez-Cruz ¹, B. Angélica Pilón-Jiménez¹, José L. Medina-Franco ¹

PUBLISHED 10 Dec 2019

Author details Author details

¹ Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, Mexico City, 04510, Mexico

Norberto Sánchez-Cruz
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

B. Angélica Pilón-Jiménez
Roles: Investigation, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing

José L. Medina-Franco
Roles: Conceptualization, Funding Acquisition, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Cheminformatics gateway.

This article is included in the Mathematical, Physical, and Computational Sciences collection.

Abstract

Background: Natural product databases are important in drug discovery and other research areas. Their structural contents and functional group analysis are relevant to increase their knowledge in terms of chemical diversity and chemical space coverage. BIOFACQUIM is an emerging database of natural products characterized and isolated in Mexico. Herein, we discuss the results of a first systematic functional group analysis and global diversity of an updated version of BIOFACQUIM.
Methods: BIOFACQUIM was augmented through a literature search and data curation. A structural content analysis of the dataset was done. This involved a functional group analysis with a novel algorithm to identify automatically all functional groups in a molecule and an assessment of the global diversity using consensus diversity plots. To this end, BIOFACQUIM was compared to two major and large databases: ChEMBL 25, and a herein assembled collection of natural products with 169,839 unique compounds.
Results: The structural content analysis showed that 16.1% of compounds, 11.3% of scaffolds, and 6.3% of functional groups present in the current version of BIOFACQUIM have not been reported in the other large reference datasets. It also gave a diversity increase in terms of scaffolds and molecular fingerprints regarding the previous version of the dataset, as well as a higher similarity to the assembled collection of natural products than to ChEMBL 25, in terms of diversity and frequent functional groups.
Conclusions: A total of 148 natural products were added to BIOFACQUIM, which meant a diversity increase in terms of scaffolds and fingerprints. Regardless of its relatively small size, there are a significant number of compounds, scaffolds, and functional groups that are not present in the reference datasets, showing that curated databases of natural products, such as BIOFACQUIM, can serve as a starting point to increase the biologically relevant chemical space.

Keywords

Consensus Diversity Plot, compound databases, data mining, diversity, natural products, functional groups, in silico

Corresponding authors: Norberto Sánchez-Cruz, José L. Medina-Franco

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by a Consejo Nacional de Tecnología (CONACyT) scholarship number 335997 (NS-C). The work was also supported by the program NUATEI (Nuevas Alternativas para el Tratamiento de Enfermedades Infecciosas), Instituto de Ciencias Biomédicas, UNAM. The authors are grateful for the computational resources granted by Dirección General de Cómputo y de Tecnologías de Información y Comunicación (DGTIC), project grant LANCAD-UNAM-DGTIC-335 that allows to use the supercomputer Miztli at UNAM.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2019 Sánchez-Cruz N et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Sánchez-Cruz N, Pilón-Jiménez BA and Medina-Franco JL. Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2019, 8(Chem Inf Sci):2071 (https://doi.org/10.12688/f1000research.21540.1) First published: 10 Dec 2019, 8(Chem Inf Sci):2071 (https://doi.org/10.12688/f1000research.21540.1) Latest published: 08 Jun 2020, 8(Chem Inf Sci):2071 (https://doi.org/10.12688/f1000research.21540.2)

Introduction

Natural product-based drug discovery continues to be an important part of drug discovery. Recently, the synergy between natural product research with molecular modeling and chemoinformatics is gaining importance, speeding up the drug discovery process^1,2. As part of these synergistic efforts, curated databases of natural products have an important role as they are major tools for data mining, hypothesis generation, and starting points of virtual screening. There are several databases of natural products in the public domain as reviewed recently³. Our research group has reported initial efforts to assemble a database of natural products from Mexico called BIOFACQUIM⁴. As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage. However, detailed functional group (FG) content analysis, which has been proven to be valuable to characterize compound databases⁵, in particular from natural sources⁶, has not been reported for BIOFACQUIM. One of the main reasons is that most of the currently available software employed for identification of functional groups rely on a predefined set of substructures, even when it has been established that one of the major features that discriminate natural products from synthetic compounds are their unique functional groups.

Herein, we report a functional group content analysis of an updated version of BIOFACQUIM. We employed a validated and novel algorithm that identifies all functional groups in a molecule. As part of the analysis and to compare the results of BIOFACQUIM we also discuss the functional group contents of other large and related databases in the public domain, namely ChEMBL 25⁷ and a herein assembled collection of natural products (NPs) with 169,839 compounds.

Methods

Databases and data curation

As described elsewhere, the first version of BIOFACQUIM was developed as a proof-of-concept database applying several filters to include compounds⁴. Briefly, the database was focused on natural products published between 2000 and 2018 by research groups in a major Mexican institution in eight indexed journals. As additional criteria for inclusion of compounds and to increase the quality and reliability of the contents of the database, the articles should have described the procedure for the isolation, purification, and characterization of the natural product. In this work, we decided to expand the contents of the BIOFACQUIM database to further explore the diversity of natural products from Mexico.

The second version of BIOFACQUIM was assembled using the same methodology described to develop the first version⁴. To achieve the objective of being representative of Mexico, it was decided to augment the number of Mexican institutions (universities, research laboratories, and research centers) publishing information of novel natural products. For the new version of the database, the same procedure for the curation was performed⁴, using Molecular Operating Environment (MOE) software. A similar procedure can be done using RDKit. In addition, it was ensured to store in the database, natural products that were collected in Mexico, only. The updated and curated version of BIOFACQUIM contains 531 compounds.

Table 1 summarizes the information of BIOFACQUIM and other major compound databases used in this work as reference: ChEMBL 25 as a representative example of the biologically tested chemical space with 1,667,893 unique compounds; and a collection of known natural products with a total of 169,839 molecules. The reference natural product collection was assembled from three general and publicly available natural products databases: the Universal Natural Products Database (UNPD)⁸, the Natural Products Atlas⁹ and Natural Products in PubChem Substance Database¹⁰. The data sets were curated using the same procedure. Briefly, compounds consisting of multiple components (e.g. salts) were split and the largest component (defined by the number of heavy atoms) was retained, as long as the number of heavy atoms of the second largest component was less than 70% of the largest component. If this was not the case, the compounds were removed from the dataset, unless the two largest components were identical. Single component compounds as well as the retained component of multiple component compounds were neutralized and canonical simplified molecular-input line-entry system (SMILES) (ignoring stereochemistry information) were generated as molecular representation. Compounds consisting of any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br and I, as well as compounds with valence errors, were removed from the dataset. Duplicate structures in the context of each database were also removed. The entire process was performed by using the open source cheminformatics toolkit RDKit.

Table 1. Compound databases analyzed in this work and summary of the scaffold diversity.

Database	Size (compounds)	Median similarity (MACCS keys - 166 bits)	Median similarity (Morgan2 - 1024 bits)	Mean distance (PCP)	Scaffold diversity (AUC)	Scaffold diversity (F₅₀)
BIOFACQUIM V1	403	0.453	0.123	3.643	0.725	0.167
BIOFACQUIM V2	503	0.442	0.119	3.313	0.710	0.170
Natural products	169,839	0.422	0.112	3.780	0.823	0.041
ChEMBL 25	1,667,893	0.382	0.119	2.183	0.811	0.056

PCP: physicochemical properties; AUC: area under the cyclic system retrieval curve.

Databases overlap

Overlap of BIOFACQUIM with the databases selected as reference was assessed in terms of three different structural levels: compounds, scaffolds, and functional groups. Compound overlap was determined in terms of canonical SMILES. For scaffold comparison we use the definition proposed by Bemis and Murcko¹¹ as implemented in RDKit, while for functional group overlap we selected the recently published definition and implementation suggested by Ertl⁵. For each structural level, we identified the unique structures belonging to each dataset as well as those belonging to two or three of them.

Functional group analysis

For the functional group content analysis we selected the algorithm recently described by Ertl⁵, which is able to identify all functional groups in a molecule based on an iterative marching through its atoms. In short, the proposed algorithm identifies all heteroatoms in a molecule, all atoms connected by multiple bonds as well as the atoms in oxirane, aziridine, and thiirane rings. Afterwards, all connected atoms are joined together to form a functional group. Single aromatic heteroatoms are retained only if they are connected to an additional aliphatic functionality. Finally, a generalization scheme is applied in which for a defined list of common FGs, information about the parent carbon is retained (e.g. to differentiate between alcohols and phenols) as well as hydrogen atoms (e.g. to differentiate between aldehydes and ketones). The method is fully described in 5. An open source version of this algorithm is available for Python (https://github.com/rdkit/rdkit/tree/master/Contrib/IFG); however, it does not cover the generalization scheme proposed originally. To this end and based on the code available, we implemented with RDKit a fragmentation approach considering the retainment of parent carbons and hydrogens proposed originally, were the remaining carbon atoms are replaced by dummy atoms. This implementation works over a SMILES string and returns a list with the canonical SMILES of the FGs identified in the molecule. The code is freely available at GitHub (https://github.com/DIFACQUIM/IFG_General). After determining the FGs content of the different datasets, we compare the proportion of the most frequent FGs at each library.

Chemical space visualization

In order to generate a visual representation of the chemical space covered by the analyzed databases, six molecular properties of pharmaceutical interest were computed for each unique compound: averaged molecular weight (AMW) partition coefficient octanol/water (SlogP), number of hydrogen bond donors (HBD), number of hydrogen bond acceptors (HBA), number of rotatable bonds (RB), and topological polar surface area (TPSA). The correlation matrix was computed and a visual representation of the chemical space was generated by using the two properties with the lowest correlation: AMW and SlogP. All calculations were done with RDKit.

Global diversity

The “global” or total diversity of the datasets was analyzed through the Consensus Diversity (CD) Plot¹² that was designed to capture, in a two-dimensional graph, the diversity of chemical libraries considering different and complementary criteria including molecular fingerprints, molecular scaffolds, physicochemical properties, and size. For this work, scaffold diversity was assessed as the area under the cyclic system retrieval curve and the fraction of chemotypes that covers 50% of the dataset (F₅₀). The median of the lower triangle from the pairwise similarity matrix computed as the Tanimoto coefficient of both MACCS keys (166-bits) fingerprint and Morgan fingerprint with radius 2 (Morgan2), were used as molecular fingerprint-based diversity. The mean distance of the lower triangle of the pairwise distance matrix computed as the Euclidean distance of the six molecular properties scaled (mean 0 and unit variance) and the number of compounds in each dataset were selected as measures of the physicochemical properties- and size- based diversity, respectively. The CD Plot was constructed using R.

Results and discussion

Update of BIOFACQUIM

As described in the Methods section, the updated BIOFACQUIM database contains the chemical structure of 531 compounds, all collected from Mexico. As with the first version⁴, each molecule is annotated with information of the chemical structure, the original source of the information (Digital Object Identifier, DOI, to reference paper), kingdom, genus, and species of the plant the natural product was isolated, place of recollection (city and state), and activity value of the reported biological activity. From the original dataset containing 423 compounds, 40 were discarded since they were not collected in Mexico, which means an increase of 148 unique compounds compared to the previous release of the database.

Database overlap

To assess the chemical space not covered by ChEMBL and NPs but by BIOFACQUIM, we characterized the structural content of the three datasets in terms of unique compounds, scaffolds and functional groups and determined the overlap among them. Figure 1 depicts Venn diagrams showing the overlap among those datasets. It should be noted that despite its small size in comparison with ChEMBL and NPs, 16.1% of compounds present in BIOFACQUIM have not been reported in those other major datasets, as well as 11.3% of its scaffolds and 6.3% of its FGs. Another remarkable observation is the fact that most of the overlap of BIOFACQUIM with the other datasets involves NPs, either alone or in combination with ChEMBL, being 79.7%, 85.5%, and 93.8% for compounds, scaffolds and FGs, respectively.

Figure 1. Overlap between BIOFACQUIM, reference natural products (NPs) and ChEMBL.

Datasets content was analyzed in terms of (a) Compounds, (b) Scaffolds and (c) Functional Groups.

Functional group analysis

A systematic analysis of functional groups was carried out over BIOFACQUIM and the two datasets selected as reference. 80, 4300, and 12,677 functional groups were identified in BIOFACQUIM, NPs and ChEMBL datasets, respectively (the overlap between them is shown in Figure 1). From the total number of functional groups present in each dataset, only 12, 14, and 23 were present in at least 1% of the corresponding library (15.0%, 0.3% and 0.2%, respectively) while 47, 2207, and 5853 (58.7%, 51.3% and 46.17%, respectively) were singletons. This result is consistent with the typical power law observed in other databases⁶. The most frequent FGs present in BIOFACQUIM are oxygen-containing FGs, being the phenolic hydroxyl group (41.4%), followed by the alcohol hydroxyl group (40.0%), ether (39.8%), alkene (26.6%) and ester (25.8%), which although in a different order, are the most frequent FGs in the herein assembled NPs collection and in other natural product libraries⁶. This is in contrast to ChEMBL in which only ether is part of the most frequent FGs while the rest of them are nitrogen containing FGs. The complete results of the FGs found of the datasets is included as Extended data (Supplementary File 1). Remarkably, even with the relatively small size of BIOFACQUIM, five compounds with unique FGs were identified, three of them isolated from Mexican plants, two from Salvia ballotiflora (FQNP204 and FQNP211)¹³ and one from Heterotheca inuloides (FQNP480)¹⁴, while the other two correspond to fungal metabolites from Malbranchea aurantiaca (FQNP435)¹⁵ isolated from bat guano, and from the endophytic fungus isolated from the Mexican tree Hintonia latiflora, Sporomiella minimoides (FQNP587)¹⁶. Figure 2 shows the chemical structure of those compounds.

Figure 2. Compounds with unique functional groups from BIOFACQUIM.

Identified functional groups are shown in red.

Chemical space visualization

For visualization of the chemical space of the datasets compared in this work, we computed six physicochemical properties of pharmaceutical relevance. As described in the Methods section, we computed the correlation matrix between them (Extended data, Table S1 in Supplementary File 2) and since all descriptors were highly correlated (correlation coefficient > 0.60), with the exception of SlogP, we selected this and its lowest correlated property, AMW, as the axes to project the libraries. We also carried out a principal component analysis over the scaled properties. As expected from the correlation matrix, the first principal component was associated with all properties but SlogP and in the opposite way for the second principal component, with explained variances of 69% and 19% respectively. Therefore, the projection of the data over those two components looks quite similar to the selected properties (Slogp and AMW) (Extended data, Figure S1 in Supplementary File 2), with the difference that the principal components do not have the physical interpretation that properties do, e.g. the orally available drugs space as described by Lipinski¹⁷. Figure 3 shows a visual representation of the chemical space of the three datasets analyzed in this work. For this plot, in order to better illustrate the unique compounds present in BIOFACQUIM, those compounds belonging to two libraries were assigned to a single one: ChEMBL if they belong to this dataset and NPs if they belong to this dataset but not to ChEMBL. Compounds shared among the three libraries were assigned to a different category. This figure shows that ChEMBL is the most widespread dataset in terms of physicochemical properties and that NPs, BIOFACQUIM, as well as the shared compounds among the three datasets, are mostly contained within this space. It could be noted that a few exceptions from BIOFACQUIM cover a small part of the space currently unexplored by ChEMBL and it is also remarkable that contrary to ChEMBL and NPs, most BIOFACQUIM compounds are part of the orally available drugs space. Different panels showing all different overlaps between datasets can be found in Extended data (Figure S2 in Supplementary File 2).

Figure 3. Visual representation of the chemical space covered by BIOFACQUIM.

Comparison of BIOFACQUIM with two reference datasets. Axes were truncated to omit the sparsely populated area.

Global diversity

In order to compare the chemical diversity of the current version of BIOFACQUIM with the previous one and the two datasets selected as reference, we employed a CD plot Figure 4 shows the plot comparing the diversity of all datasets considering four different criteria: scaffolds in the y-axis, molecular fingerprints in the x-axis, physicochemical properties as the filling of the data points in a continuous color scale, and number of compounds as the data points size. This comparison shows the relatively small size of BIOFACQUIM in comparison with the reference datasets. As compared to the previous release of BIOFACQUIM, the current version has increased its diversity in terms of scaffolds and fingerprints but decreased in terms of physicochemical properties. Also, it is shown that its diversity in terms of molecular fingerprints and physicochemical properties, although not the greatest ones of the three datasets, are on the order of those for NPs, contrary to scaffolds, in which is the most diverse.

Figure 4. Consensus diversity plot of BIOFACQUIM.

The molecular fingerprint diversity of each data set is represented on the x-axis and was defined as the median Tanimoto coefficient of MACCS keys (166-bits) fingerprint. The scaffold diversity of each database is represented on the y-axis and was defined as the area under the corresponding cyclic system retrieval curve. The diversity based on physicochemical properties (PCP) was defined as the mean euclidean distance of six scaled physicochemical properties (SlogP, TPSA, AMW, RB, HBD, and HBA) and is shown as the filling of the data points using a continuous color scale. The number of compounds is represented by the size of the data points.

Conclusions

The current version of BIOFACQUIM involved the addition of 148 natural products. This was reflected in a diversity increase based on both scaffolds and molecular fingerprints. It was shown that in terms of diversity and structural content overlap, BIOFACQUIM is more similar to the assembled set of natural products than to the set of biologically tested compounds. The herein reported chemoinformatic study revealed that most of the compounds contained in BIOFACQUIM are focused in the orally active drugs space in terms of physicochemical properties. Interestingly, despite the fact of its relative small size, there were identified a significant number of compounds, scaffolds and functional groups (81, 28 and 5, respectively) that were not present in the two large sets used as reference, showing that curated databases of natural products, such as BIOFACQUIM, can serve as a starting point for the study and increase of the biologically relevant chemical space.

Data availability

Underlying data

Figshare: BIOFACQUIM_V2. http://doi.org/10.6084/m9.figshare.11312702

This file contains the chemical structures of 531 compounds in SDF format, alongside ID number, compound name, simplified molecular input line entry system, literature reference, kingdom, genus, species, geographical location and biological activity.

Underlying data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Extended data

Figshare: Supporting information for "Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database". http://doi.org/10.6084/m9.figshare.11312735

This project contains the following extended data:

Supplementary File 1. File with summary results of the functional group analysis.
Supplementary File 2. This file contains the following supporting tables and figures:
- Table S1. Correlation matrix for the six PCP computed for BIOFACQUIM and the two reference datasets.
- Figure S1. Visual representation of the chemical space covered by the three compound libraries generated by principal component analysis.
- Figure S2. Visual representation of the chemical space covered by the possible overlaps between libraries.

Extended data are available under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).

Acknowledgments

We thank Roberto Hernández for his valuable contribution toward the update of BIOFACQUIM.

Faculty Opinions recommended

References

1. Kinghorn AD, Falk H, Gibbons S, et al.: eds. Progress in the Chemistry of Organic Natural Products 110: Cheminformatics in Natural Product Research. Cham: Springer International Publishing. 2019; 110. Publisher Full Text
2. Medina-Franco JL: New Approaches for the Discovery of Pharmacologically-Active Natural Compounds. Biomolecules. 2019; 9(3): pii: E115. PubMed Abstract | Publisher Full Text | Free Full Text
3. Chen Y, Garcia de Lomana M, Friedrich NO, et al.: Characterization of the Chemical Space of Known and Readily Obtainable Natural Products. J Chem Inf Model. 2018; 58(8): 1518–1532. PubMed Abstract | Publisher Full Text
4. Pilón-Jiménez BA, Saldívar-González FI, Díaz-Eufracio BI, et al.: BIOFACQUIM: A mexican compound database of natural products. Biomolecules. 2019; 9(1): pii: E31. PubMed Abstract | Publisher Full Text | Free Full Text
5. Ertl P: An algorithm to identify functional groups in organic molecules. J Cheminform. 2017; 9(1): 36. PubMed Abstract | Publisher Full Text | Free Full Text
6. Ertl P, Schuhmann T: A Systematic Cheminformatics Analysis of Functional Groups Occurring in Natural Products. J Nat Prod. 2019; 82(5): 1258–1263. PubMed Abstract | Publisher Full Text
7. Gaulton A, Hersey A, Nowotka M, et al.: The ChEMBL database in 2017. Nucleic Acids Res. 2017; 45(D1): D945–D954. PubMed Abstract | Publisher Full Text | Free Full Text
8. Gu J, Gui Y, Chen L, et al.: Use of natural products as chemical library for drug discovery and network pharmacology. PLoS One. 2013; 8(4): e62839. PubMed Abstract | Publisher Full Text | Free Full Text
9. van Santen JA, Jacob G, Singh AL, et al.: The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent Sci. 2019; 5(11): 1824–1833. Publisher Full Text
10. Ming H, Tiejun C, Yanli W, et al.: Web search and data mining of natural products and their bioactivities in PubChem. Sci China Chem. 2013; 56(10): 1424–1435. PubMed Abstract | Publisher Full Text | Free Full Text
11. Bemis GW, Murcko MA: The properties of known drugs. 1. Molecular frameworks. J Med Chem. 1996; 39(15): 2887–2893. PubMed Abstract | Publisher Full Text
12. González-Medina M, Prieto-Martínez FD, Owen JR, et al.: Consensus Diversity Plots: a global diversity analysis of chemical libraries. J Cheminform. 2016; 8: 63. PubMed Abstract | Publisher Full Text | Free Full Text
13. Esquivel B, Bustos-Brito C, Sánchez-Castellanos M, et al.: Structure, Absolute Configuration, and Antiproliferative Activity of Abietane and Icetexane Diterpenoids from Salvia ballotiflora. Molecules. 2017; 22(10): pii: E1690. PubMed Abstract | Publisher Full Text | Free Full Text
14. Delgado G, del Socorro Olivares M, Chávez MI, et al.: Antiinflammatory constituents from Heterotheca inuloides. J Nat Prod. 2001; 64(7): 861–864. PubMed Abstract | Publisher Full Text
15. Martínez-Luis S, González MC, Ulloa M, et al.: Phytotoxins from the fungus Malbranchea aurantiaca. Phytochemistry. 2005; 66(9): 1012–1016. PubMed Abstract | Publisher Full Text
16. Leyte-Lugo M, Figueroa M, González Mdel C, et al.: Metabolites from the endophytic [corrected] fungus Sporormiella minimoides isolated from Hintonia latiflora. Phytochemistry. 2013; 96: 273–278. PubMed Abstract | Publisher Full Text
17. Lipinski CA, Lombardo F, Dominy BW, et al.: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 2001; 46(1–3): 3–26. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 10 Dec 2019

Author details Author details

¹ Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, Mexico City, 04510, Mexico

B. Angélica Pilón-Jiménez
Roles: Investigation, Methodology, Writing – Original Draft Preparation, Writing – Review & Editing

José L. Medina-Franco
Roles: Conceptualization, Funding Acquisition, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by a Consejo Nacional de Tecnología (CONACyT) scholarship number 335997 (NS-C). The work was also supported by the program NUATEI (Nuevas Alternativas para el Tratamiento de Enfermedades Infecciosas), Instituto de Ciencias Biomédicas, UNAM. The authors are grateful for the computational resources granted by Dirección General de Cómputo y de Tecnologías de Información y Comunicación (DGTIC), project grant LANCAD-UNAM-DGTIC-335 that allows to use the supercomputer Miztli at UNAM.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 08 Jun 2020, 8:2071

https://doi.org/10.12688/f1000research.21540.2

version 1

Published: 10 Dec 2019, 8:2071

https://doi.org/10.12688/f1000research.21540.1

© 2019 Sánchez-Cruz N et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Sánchez-Cruz N, Pilón-Jiménez BA and Medina-Franco JL. Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2019, 8(Chem Inf Sci):2071 (https://doi.org/10.12688/f1000research.21540.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 10 Dec 2019

Views

Reviewer Report 30 Jan 2020

Trong D. Tran, GeneCology Research Centre, University of the Sunshine Coast, Maroochydore, Qld, Australia

Approved

https://doi.org/10.5256/f1000research.23733.r58743

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Natural product chemistry, Drug Discovery.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

16 Jun 2020

Author Response

The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the ... Continue reading The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the potential of using BIOFACQUIM as a good source for virtual screening in drug discovery.
Generally, the study was well designed, and the paper was well-written. However, since stereochemistry plays a significant role in the interactions between small molecules and protein targets, I suggest the authors should add the analysis of the fraction of sp -hydridized atoms for the new BIOFACQUIM version and compare that value with those of other databases to assess its 3D unique.
RESPONSE: We thank the reviewer for the suggestion. In the revised “Methods” and “Results and discussions”, we added a new section “Complexity analysis” to compare the distribution of the fraction of carbon atoms that are sp3 hybridized. A new figure (3 in the revised version) was added.

Can authors detail the “eight indexed journals” which was referred in the “Methods - Databases and data curation”?
RESPONSE: The names of the 8 journals were added: “Journal of Ethnopharmacology, Natural Products Research, Journal of Agricultural and Food Chemistry, Journal of Natural Products, Planta Medica, Phytochemistry, Natural Product Letters, and Molecules.”
The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the potential of using BIOFACQUIM as a good source for virtual screening in drug discovery.
Generally, the study was well designed, and the paper was well-written. However, since stereochemistry plays a significant role in the interactions between small molecules and protein targets, I suggest the authors should add the analysis of the fraction of sp -hydridized atoms for the new BIOFACQUIM version and compare that value with those of other databases to assess its 3D unique.
RESPONSE: We thank the reviewer for the suggestion. In the revised “Methods” and “Results and discussions”, we added a new section “Complexity analysis” to compare the distribution of the fraction of carbon atoms that are sp3 hybridized. A new figure (3 in the revised version) was added.

Can authors detail the “eight indexed journals” which was referred in the “Methods - Databases and data curation”?
RESPONSE: The names of the 8 journals were added: “Journal of Ethnopharmacology, Natural Products Research, Journal of Agricultural and Food Chemistry, Journal of Natural Products, Planta Medica, Phytochemistry, Natural Product Letters, and Molecules.”
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

16 Jun 2020

Author Response

The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the ... Continue reading The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the potential of using BIOFACQUIM as a good source for virtual screening in drug discovery.
Generally, the study was well designed, and the paper was well-written. However, since stereochemistry plays a significant role in the interactions between small molecules and protein targets, I suggest the authors should add the analysis of the fraction of sp -hydridized atoms for the new BIOFACQUIM version and compare that value with those of other databases to assess its 3D unique.
RESPONSE: We thank the reviewer for the suggestion. In the revised “Methods” and “Results and discussions”, we added a new section “Complexity analysis” to compare the distribution of the fraction of carbon atoms that are sp3 hybridized. A new figure (3 in the revised version) was added.

Can authors detail the “eight indexed journals” which was referred in the “Methods - Databases and data curation”?
RESPONSE: The names of the 8 journals were added: “Journal of Ethnopharmacology, Natural Products Research, Journal of Agricultural and Food Chemistry, Journal of Natural Products, Planta Medica, Phytochemistry, Natural Product Letters, and Molecules.”
The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the potential of using BIOFACQUIM as a good source for virtual screening in drug discovery.
Generally, the study was well designed, and the paper was well-written. However, since stereochemistry plays a significant role in the interactions between small molecules and protein targets, I suggest the authors should add the analysis of the fraction of sp -hydridized atoms for the new BIOFACQUIM version and compare that value with those of other databases to assess its 3D unique.
RESPONSE: We thank the reviewer for the suggestion. In the revised “Methods” and “Results and discussions”, we added a new section “Complexity analysis” to compare the distribution of the fraction of carbon atoms that are sp3 hybridized. A new figure (3 in the revised version) was added.

Can authors detail the “eight indexed journals” which was referred in the “Methods - Databases and data curation”?
RESPONSE: The names of the 8 journals were added: “Journal of Ethnopharmacology, Natural Products Research, Journal of Agricultural and Food Chemistry, Journal of Natural Products, Planta Medica, Phytochemistry, Natural Product Letters, and Molecules.”
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 24 Jan 2020

W. Patrick Walters, Relay Therapeutics, Cambridge, MA, USA

Approved

https://doi.org/10.5256/f1000research.23733.r57650

This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will be of use to medicinal and computational chemists, as well as those involved in research on natural products. The authors should be commended for making their database and associated code available as Open Source.

While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear:

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."
In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.
In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.
In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

In the methods section, the authors detail their procedure for “cleaning” the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper.
The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.

The description of the “Consensus Diversity (CD) Plot” in the “Global diversity” section is a bit vague. An additional paragraph describing this method would be useful.
The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links:

I’m not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.

It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in “drug-like” chemical space rather than the very large range shown in Figure 3.

In summary, this paper provides a valuable resource to researchers working in a number of different areas. Hopefully, these suggestions will provide minor enhancements to the work.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Computational Chemistry, Cheminformatics, Drug Design

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

16 Jun 2020

Author Response
This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will ... Continue reading
This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will be of use to medicinal and computational chemists, as well as those involved in research on natural products. The authors should be commended for making their database and associated code available as Open Source. While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear:
RESPONSE: Thank you for the feedback and constructive comments. In the second version of the manuscript we incorporated all your suggestions as described hereunder.

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."

In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.

In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.

In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

RESPONSE: All four changes in the text were done.

In the methods section, the authors detail their procedure for “cleaning” the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper.
RESPONSE: In agreement with the suggestion, the code to prepare the databases was made freely available. In the revised version of the manuscript “Methods, Databases and data curation section” we included the sentence: “The code is available at GitHub (https://github.com/DIFACQUIM/IFG_General).”

The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.
RESPONSE: To address this point we repeated the analysis standardizing the structures using MolVS that does consider tautomers. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section. All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly.

The description of the “Consensus Diversity (CD) Plot” in the “Global diversity” section is a bit vague. An additional paragraph describing this method would be useful.
RESPONSE: We described in more detail the Consensus Diversity (CD) Plots.

The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links:
https://datavizproject.com/data-type/hexagonal-binning/
https://rdrr.io/cran/som.nn/man/hexbinpie.html
RESPONSE: To address the reviewer’s point, we used the recently published visualization method TMAP (Tree Manifold Approximation and Projection) (J. Cheminform. 2020, 12, 12.). This approach is suited for visualization of very large data sets (up to millions of data points with high dimensionality). Figure 3 was replaced with a TMAP where the 3 data sets are shown in separate panels and a heatmap in the form of hexbin plot for the distribution of compounds is included (Figure 4 in the revised manuscript). A description of TMAP was added to the Methods section citing the original work and the one generated is discussed accordingly in the Results and Discussion section.

I’m not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.
RESPONSE: Following the recommendation, we now compare the three data sets using TMAPs. For the description of compounds, we selected Morgan fingerprints with radius 2.

It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in “drug-like” chemical space rather than the very large range shown in Figure 3.
RESPONSE: In agreement with the suggestion, in the newly added TMAPs (Figure 4 in the revised manuscript) we included a visualization of drug-like subsets of the three databases.
This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will be of use to medicinal and computational chemists, as well as those involved in research on natural products. The authors should be commended for making their database and associated code available as Open Source. While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear:
RESPONSE: Thank you for the feedback and constructive comments. In the second version of the manuscript we incorporated all your suggestions as described hereunder.

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."

In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.

In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.

In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

RESPONSE: All four changes in the text were done.

In the methods section, the authors detail their procedure for “cleaning” the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper.
RESPONSE: In agreement with the suggestion, the code to prepare the databases was made freely available. In the revised version of the manuscript “Methods, Databases and data curation section” we included the sentence: “The code is available at GitHub (https://github.com/DIFACQUIM/IFG_General).”

The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.
RESPONSE: To address this point we repeated the analysis standardizing the structures using MolVS that does consider tautomers. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section. All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly.

The description of the “Consensus Diversity (CD) Plot” in the “Global diversity” section is a bit vague. An additional paragraph describing this method would be useful.
RESPONSE: We described in more detail the Consensus Diversity (CD) Plots.

The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links:
https://datavizproject.com/data-type/hexagonal-binning/
https://rdrr.io/cran/som.nn/man/hexbinpie.html
RESPONSE: To address the reviewer’s point, we used the recently published visualization method TMAP (Tree Manifold Approximation and Projection) (J. Cheminform. 2020, 12, 12.). This approach is suited for visualization of very large data sets (up to millions of data points with high dimensionality). Figure 3 was replaced with a TMAP where the 3 data sets are shown in separate panels and a heatmap in the form of hexbin plot for the distribution of compounds is included (Figure 4 in the revised manuscript). A description of TMAP was added to the Methods section citing the original work and the one generated is discussed accordingly in the Results and Discussion section.

I’m not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.
RESPONSE: Following the recommendation, we now compare the three data sets using TMAPs. For the description of compounds, we selected Morgan fingerprints with radius 2.

It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in “drug-like” chemical space rather than the very large range shown in Figure 3.
RESPONSE: In agreement with the suggestion, in the newly added TMAPs (Figure 4 in the revised manuscript) we included a visualization of drug-like subsets of the three databases.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

16 Jun 2020

Author Response
This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will ... Continue reading
This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will be of use to medicinal and computational chemists, as well as those involved in research on natural products. The authors should be commended for making their database and associated code available as Open Source. While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear:
RESPONSE: Thank you for the feedback and constructive comments. In the second version of the manuscript we incorporated all your suggestions as described hereunder.

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."

In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.

In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.

In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

RESPONSE: All four changes in the text were done.

In the methods section, the authors detail their procedure for “cleaning” the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper.
RESPONSE: In agreement with the suggestion, the code to prepare the databases was made freely available. In the revised version of the manuscript “Methods, Databases and data curation section” we included the sentence: “The code is available at GitHub (https://github.com/DIFACQUIM/IFG_General).”

The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.
RESPONSE: To address this point we repeated the analysis standardizing the structures using MolVS that does consider tautomers. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section. All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly.

The description of the “Consensus Diversity (CD) Plot” in the “Global diversity” section is a bit vague. An additional paragraph describing this method would be useful.
RESPONSE: We described in more detail the Consensus Diversity (CD) Plots.

The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links:
https://datavizproject.com/data-type/hexagonal-binning/
https://rdrr.io/cran/som.nn/man/hexbinpie.html
RESPONSE: To address the reviewer’s point, we used the recently published visualization method TMAP (Tree Manifold Approximation and Projection) (J. Cheminform. 2020, 12, 12.). This approach is suited for visualization of very large data sets (up to millions of data points with high dimensionality). Figure 3 was replaced with a TMAP where the 3 data sets are shown in separate panels and a heatmap in the form of hexbin plot for the distribution of compounds is included (Figure 4 in the revised manuscript). A description of TMAP was added to the Methods section citing the original work and the one generated is discussed accordingly in the Results and Discussion section.

I’m not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.
RESPONSE: Following the recommendation, we now compare the three data sets using TMAPs. For the description of compounds, we selected Morgan fingerprints with radius 2.

It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in “drug-like” chemical space rather than the very large range shown in Figure 3.
RESPONSE: In agreement with the suggestion, in the newly added TMAPs (Figure 4 in the revised manuscript) we included a visualization of drug-like subsets of the three databases.
This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will be of use to medicinal and computational chemists, as well as those involved in research on natural products. The authors should be commended for making their database and associated code available as Open Source. While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear:
RESPONSE: Thank you for the feedback and constructive comments. In the second version of the manuscript we incorporated all your suggestions as described hereunder.

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."

In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.

In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.

In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

RESPONSE: All four changes in the text were done.

In the methods section, the authors detail their procedure for “cleaning” the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper.
RESPONSE: In agreement with the suggestion, the code to prepare the databases was made freely available. In the revised version of the manuscript “Methods, Databases and data curation section” we included the sentence: “The code is available at GitHub (https://github.com/DIFACQUIM/IFG_General).”

The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.
RESPONSE: To address this point we repeated the analysis standardizing the structures using MolVS that does consider tautomers. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section. All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly.

The description of the “Consensus Diversity (CD) Plot” in the “Global diversity” section is a bit vague. An additional paragraph describing this method would be useful.
RESPONSE: We described in more detail the Consensus Diversity (CD) Plots.

The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links:
https://datavizproject.com/data-type/hexagonal-binning/
https://rdrr.io/cran/som.nn/man/hexbinpie.html
RESPONSE: To address the reviewer’s point, we used the recently published visualization method TMAP (Tree Manifold Approximation and Projection) (J. Cheminform. 2020, 12, 12.). This approach is suited for visualization of very large data sets (up to millions of data points with high dimensionality). Figure 3 was replaced with a TMAP where the 3 data sets are shown in separate panels and a heatmap in the form of hexbin plot for the distribution of compounds is included (Figure 4 in the revised manuscript). A description of TMAP was added to the Methods section citing the original work and the one generated is discussed accordingly in the Results and Discussion section.

I’m not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.
RESPONSE: Following the recommendation, we now compare the three data sets using TMAPs. For the description of compounds, we selected Morgan fingerprints with radius 2.

It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in “drug-like” chemical space rather than the very large range shown in Figure 3.
RESPONSE: In agreement with the suggestion, in the newly added TMAPs (Figure 4 in the revised manuscript) we included a visualization of drug-like subsets of the three databases.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 13 Dec 2019

Johannes Kirchmair, Department of Pharmaceutical Chemistry, University of Bergen, Vienna, Austria

Approved with Reservations

https://doi.org/10.5256/f1000research.23733.r57653

Sánchez-Cruz et al. report on the diversity of an extended version of BIOFACQUIM, a database of natural products extracted from organisms endemic to Mexico.

The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the outcomes and conclusions of the statistical analysis but for the functional group analysis it will be important to understand whether the five new functional groups from BIOFACQUIM presented in Figure 2 are indeed unique to that source. Could the authors please use InChI, MolVS or a similar notation/approach for standardizing the molecular (sub-) structures and conduct a tautomer-invariant identification of functional groups?

Language issues are apparent throughout the manuscript and careful proofreading will be required. Some (parts of) statements are unclear, e.g.
"it was reported an initial scaffold content"
"the articles should have described the procedure"
"decided to augment the number of Mexican institutions"
"A similar procedure can be done using RDKit"

Minor comment: Could the authors please clarify in the manuscript the sources of the natural products included in BIOFACQUIM (e.g. plants, bacteria, fungi)?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: cheminformatics, natural products research

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

16 Jun 2020

Author Response

The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the ... Continue reading The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the outcomes and conclusions of the statistical analysis but for the functional group analysis it will be important to understand whether the five new functional groups from BIOFACQUIM presented in Figure 2 are indeed unique to that source. Could the authors please use InChI, MolVS or a similar notation/approach for standardizing the molecular (sub-) structures and conduct a tautomer-invariant identification of functional groups?
RESPONSE: Following the recommendation, we repeated the analysis reported in the manuscript standardizing the structures of all databases using MolVS. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section and the code was made available at GitHub (https://github.com/DIFACQUIM/IFG_General). All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly (including figure 2).

Language issues are apparent throughout the manuscript and careful proofreading will be required. Some (parts of) statements are unclear, e.g.
"it was reported an initial scaffold content"
"the articles should have described the procedure"
"decided to augment the number of Mexican institutions"
"A similar procedure can be done using RDKit"
RESPONSE: The statements pointed out were fixed. We also proofread carefully the second version of the manuscript.

Minor comment: Could the authors please clarify in the manuscript the sources of the natural products included in BIOFACQUIM (e.g. plants, bacteria, fungi)?
RESPONSE: In the revised “Results and discussion, Update of BIOFACQUIM section” we included the number of compounds from each source (“…406 from plants, 97 from fungus, 15 from propolis and 13 from marine animals….”).
The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the outcomes and conclusions of the statistical analysis but for the functional group analysis it will be important to understand whether the five new functional groups from BIOFACQUIM presented in Figure 2 are indeed unique to that source. Could the authors please use InChI, MolVS or a similar notation/approach for standardizing the molecular (sub-) structures and conduct a tautomer-invariant identification of functional groups?
RESPONSE: Following the recommendation, we repeated the analysis reported in the manuscript standardizing the structures of all databases using MolVS. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section and the code was made available at GitHub (https://github.com/DIFACQUIM/IFG_General). All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly (including figure 2).

Language issues are apparent throughout the manuscript and careful proofreading will be required. Some (parts of) statements are unclear, e.g.
"it was reported an initial scaffold content"
"the articles should have described the procedure"
"decided to augment the number of Mexican institutions"
"A similar procedure can be done using RDKit"
RESPONSE: The statements pointed out were fixed. We also proofread carefully the second version of the manuscript.

Minor comment: Could the authors please clarify in the manuscript the sources of the natural products included in BIOFACQUIM (e.g. plants, bacteria, fungi)?
RESPONSE: In the revised “Results and discussion, Update of BIOFACQUIM section” we included the number of compounds from each source (“…406 from plants, 97 from fungus, 15 from propolis and 13 from marine animals….”).
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

16 Jun 2020

Author Response

The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the ... Continue reading The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the outcomes and conclusions of the statistical analysis but for the functional group analysis it will be important to understand whether the five new functional groups from BIOFACQUIM presented in Figure 2 are indeed unique to that source. Could the authors please use InChI, MolVS or a similar notation/approach for standardizing the molecular (sub-) structures and conduct a tautomer-invariant identification of functional groups?
RESPONSE: Following the recommendation, we repeated the analysis reported in the manuscript standardizing the structures of all databases using MolVS. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section and the code was made available at GitHub (https://github.com/DIFACQUIM/IFG_General). All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly (including figure 2).

Language issues are apparent throughout the manuscript and careful proofreading will be required. Some (parts of) statements are unclear, e.g.
"it was reported an initial scaffold content"
"the articles should have described the procedure"
"decided to augment the number of Mexican institutions"
"A similar procedure can be done using RDKit"
RESPONSE: The statements pointed out were fixed. We also proofread carefully the second version of the manuscript.

Minor comment: Could the authors please clarify in the manuscript the sources of the natural products included in BIOFACQUIM (e.g. plants, bacteria, fungi)?
RESPONSE: In the revised “Results and discussion, Update of BIOFACQUIM section” we included the number of compounds from each source (“…406 from plants, 97 from fungus, 15 from propolis and 13 from marine animals….”).
The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the outcomes and conclusions of the statistical analysis but for the functional group analysis it will be important to understand whether the five new functional groups from BIOFACQUIM presented in Figure 2 are indeed unique to that source. Could the authors please use InChI, MolVS or a similar notation/approach for standardizing the molecular (sub-) structures and conduct a tautomer-invariant identification of functional groups?
RESPONSE: Following the recommendation, we repeated the analysis reported in the manuscript standardizing the structures of all databases using MolVS. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section and the code was made available at GitHub (https://github.com/DIFACQUIM/IFG_General). All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly (including figure 2).

Language issues are apparent throughout the manuscript and careful proofreading will be required. Some (parts of) statements are unclear, e.g.
"it was reported an initial scaffold content"
"the articles should have described the procedure"
"decided to augment the number of Mexican institutions"
"A similar procedure can be done using RDKit"
RESPONSE: The statements pointed out were fixed. We also proofread carefully the second version of the manuscript.

Minor comment: Could the authors please clarify in the manuscript the sources of the natural products included in BIOFACQUIM (e.g. plants, bacteria, fungi)?
RESPONSE: In the revised “Results and discussion, Update of BIOFACQUIM section” we included the number of compounds from each source (“…406 from plants, 97 from fungus, 15 from propolis and 13 from marine animals….”).
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 10 Dec 2019

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 08 Jun 20	read
Version 1 10 Dec 19	read	read	read

Johannes Kirchmair, University of Bergen, Vienna, Austria
W. Patrick Walters, Relay Therapeutics, Cambridge, USA
Trong D. Tran, University of the Sunshine Coast, Maroochydore, Australia

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

9 Views

11 Jun 2020 | for Version 2

Johannes Kirchmair, Department of Pharmaceutical Chemistry, University of Bergen, Vienna, Austria

9 Views Cite this report Responses(0)

Approved

I have no further comments or requests.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

cheminformatics, natural products research

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

20 Views

30 Jan 2020 | for Version 1

Trong D. Tran, GeneCology Research Centre, University of the Sunshine Coast, Maroochydore, Qld, Australia

20 Views Cite this report Responses(1)

Approved

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Natural product chemistry, Drug Discovery.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

The manuscript by Sánchez-Cruz et al. describes the cheminformatic analysis of the updated BIOFACQUIM database consisting of 531 Mexican natural products. Despite its relatively small size, the results supported the potential of using BIOFACQUIM as a good source for virtual screening in drug discovery.
Generally, the study was well designed, and the paper was well-written. However, since stereochemistry plays a significant role in the interactions between small molecules and protein targets, I suggest the authors should add the analysis of the fraction of sp -hydridized atoms for the new BIOFACQUIM version and compare that value with those of other databases to assess its 3D unique.
RESPONSE: We thank the reviewer for the suggestion. In the revised “Methods” and “Results and discussions”, we added a new section “Complexity analysis” to compare the distribution of the fraction of carbon atoms that are sp3 hybridized. A new figure (3 in the revised version) was added.

Can authors detail the “eight indexed journals” which was referred in the “Methods - Databases and data curation”?
RESPONSE: The names of the 8 journals were added: “Journal of Ethnopharmacology, Natural Products Research, Journal of Agricultural and Food Chemistry, Journal of Natural Products, Planta Medica, Phytochemistry, Natural Product Letters, and Molecules.”

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

19 Views

24 Jan 2020 | for Version 1

W. Patrick Walters, Relay Therapeutics, Cambridge, MA, USA

19 Views Cite this report Responses(1)

Approved

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."
In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.
In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.
In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Computational Chemistry, Cheminformatics, Drug Design

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

This paper describes an analysis of the BIOFACQUIM database of natural products derived from Mexico. The paper is well written and provides a useful overview of a resource that will be of use to medicinal and computational chemists, as well as those involved in research on natural products. The authors should be commended for making their database and associated code available as Open Source. While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear:
RESPONSE: Thank you for the feedback and constructive comments. In the second version of the manuscript we incorporated all your suggestions as described hereunder.

The second sentence in the abstract could be changed to read: “An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases."
In the first sentence of the “Methods” section, the authors may want to consider changing “done” to “performed”.
In the Introduction, the sentence “As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage” could be changed to “As part of that work, scaffold content and chemical space diversity were examined”.
In the first paragraph of the “Results and discussion” section, “place of recollection” would be better stated as “place of collection”.

RESPONSE: All four changes in the text were done.

In the methods section, the authors detail their procedure for “cleaning” the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper.
RESPONSE: In agreement with the suggestion, the code to prepare the databases was made freely available. In the revised version of the manuscript “Methods, Databases and data curation section” we included the sentence: “The code is available at GitHub (https://github.com/DIFACQUIM/IFG_General).”

The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.
RESPONSE: To address this point we repeated the analysis standardizing the structures using MolVS that does consider tautomers. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section. All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly.

The description of the “Consensus Diversity (CD) Plot” in the “Global diversity” section is a bit vague. An additional paragraph describing this method would be useful.
RESPONSE: We described in more detail the Consensus Diversity (CD) Plots.

The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links:
https://datavizproject.com/data-type/hexagonal-binning/
https://rdrr.io/cran/som.nn/man/hexbinpie.html
RESPONSE: To address the reviewer’s point, we used the recently published visualization method TMAP (Tree Manifold Approximation and Projection) (J. Cheminform. 2020, 12, 12.). This approach is suited for visualization of very large data sets (up to millions of data points with high dimensionality). Figure 3 was replaced with a TMAP where the 3 data sets are shown in separate panels and a heatmap in the form of hexbin plot for the distribution of compounds is included (Figure 4 in the revised manuscript). A description of TMAP was added to the Methods section citing the original work and the one generated is discussed accordingly in the Results and Discussion section.

I’m not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space:
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html.
RESPONSE: Following the recommendation, we now compare the three data sets using TMAPs. For the description of compounds, we selected Morgan fingerprints with radius 2.

It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in “drug-like” chemical space rather than the very large range shown in Figure 3.
RESPONSE: In agreement with the suggestion, in the newly added TMAPs (Figure 4 in the revised manuscript) we included a visualization of drug-like subsets of the three databases.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

33 Views

13 Dec 2019 | for Version 1

Johannes Kirchmair, Department of Pharmaceutical Chemistry, University of Bergen, Vienna, Austria

33 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

cheminformatics, natural products research

Respond to this report

Responses (1)

Author Response

16 Jun 2020

José L. Medina-Franco, Department of Pharmacy, School of Chemistry, National Autonomous University of Mexico, Mexico City, 04510, Mexico

The overall scientific approach is sound. The only technical issue that I believe requires attention is tautomerism, which appears to have been neglected. Consideration of tautomerism will unlikely change the outcomes and conclusions of the statistical analysis but for the functional group analysis it will be important to understand whether the five new functional groups from BIOFACQUIM presented in Figure 2 are indeed unique to that source. Could the authors please use InChI, MolVS or a similar notation/approach for standardizing the molecular (sub-) structures and conduct a tautomer-invariant identification of functional groups?
RESPONSE: Following the recommendation, we repeated the analysis reported in the manuscript standardizing the structures of all databases using MolVS. The standardization procedure was described in detail in the revised “Methods, Databases and curation” section and the code was made available at GitHub (https://github.com/DIFACQUIM/IFG_General). All the quantities associated with the new and correct number of unique molecules were revised and the main text and figures were updated accordingly (including figure 2).

Language issues are apparent throughout the manuscript and careful proofreading will be required. Some (parts of) statements are unclear, e.g.
"it was reported an initial scaffold content"
"the articles should have described the procedure"
"decided to augment the number of Mexican institutions"
"A similar procedure can be done using RDKit"
RESPONSE: The statements pointed out were fixed. We also proofread carefully the second version of the manuscript.

Minor comment: Could the authors please clarify in the manuscript the sources of the natural products included in BIOFACQUIM (e.g. plants, bacteria, fungi)?
RESPONSE: In the revised “Results and discussion, Update of BIOFACQUIM section” we included the number of compounds from each source (“…406 from plants, 97 from fungus, 15 from propolis and 13 from marine animals….”).

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Kinghorn AD, Falk H, Gibbons S, et al.: eds. Progress in the Chemistry of Organic Natural Products 110: Cheminformatics in Natural Product Research. Cham: Springer International Publishing. 2019; 110. Publisher Full Text

[2] 2. Medina-Franco JL: New Approaches for the Discovery of Pharmacologically-Active Natural Compounds. Biomolecules. 2019; 9(3): pii: E115. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Chen Y, Garcia de Lomana M, Friedrich NO, et al.: Characterization of the Chemical Space of Known and Readily Obtainable Natural Products. J Chem Inf Model. 2018; 58(8): 1518–1532. PubMed Abstract | Publisher Full Text

[4] 4. Pilón-Jiménez BA, Saldívar-González FI, Díaz-Eufracio BI, et al.: BIOFACQUIM: A mexican compound database of natural products. Biomolecules. 2019; 9(1): pii: E31. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Ertl P: An algorithm to identify functional groups in organic molecules. J Cheminform. 2017; 9(1): 36. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Ertl P, Schuhmann T: A Systematic Cheminformatics Analysis of Functional Groups Occurring in Natural Products. J Nat Prod. 2019; 82(5): 1258–1263. PubMed Abstract | Publisher Full Text

[7] 7. Gaulton A, Hersey A, Nowotka M, et al.: The ChEMBL database in 2017. Nucleic Acids Res. 2017; 45(D1): D945–D954. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Gu J, Gui Y, Chen L, et al.: Use of natural products as chemical library for drug discovery and network pharmacology. PLoS One. 2013; 8(4): e62839. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. van Santen JA, Jacob G, Singh AL, et al.: The natural products atlas: an open access knowledge base for microbial natural products discovery. ACS Cent Sci. 2019; 5(11): 1824–1833. Publisher Full Text

[10] 10. Ming H, Tiejun C, Yanli W, et al.: Web search and data mining of natural products and their bioactivities in PubChem. Sci China Chem. 2013; 56(10): 1424–1435. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Bemis GW, Murcko MA: The properties of known drugs. 1. Molecular frameworks. J Med Chem. 1996; 39(15): 2887–2893. PubMed Abstract | Publisher Full Text

[12] 12. González-Medina M, Prieto-Martínez FD, Owen JR, et al.: Consensus Diversity Plots: a global diversity analysis of chemical libraries. J Cheminform. 2016; 8: 63. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Esquivel B, Bustos-Brito C, Sánchez-Castellanos M, et al.: Structure, Absolute Configuration, and Antiproliferative Activity of Abietane and Icetexane Diterpenoids from Salvia ballotiflora. Molecules. 2017; 22(10): pii: E1690. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Delgado G, del Socorro Olivares M, Chávez MI, et al.: Antiinflammatory constituents from Heterotheca inuloides. J Nat Prod. 2001; 64(7): 861–864. PubMed Abstract | Publisher Full Text

[15] 15. Martínez-Luis S, González MC, Ulloa M, et al.: Phytotoxins from the fungus Malbranchea aurantiaca. Phytochemistry. 2005; 66(9): 1012–1016. PubMed Abstract | Publisher Full Text

[16] 16. Leyte-Lugo M, Figueroa M, González Mdel C, et al.: Metabolites from the endophytic [corrected] fungus Sporormiella minimoides isolated from Hintonia latiflora. Phytochemistry. 2013; 96: 273–278. PubMed Abstract | Publisher Full Text

[17] 17. Lipinski CA, Lombardo F, Dominy BW, et al.: Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 2001; 46(1–3): 3–26. PubMed Abstract | Publisher Full Text

Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database

Abstract

Keywords

Introduction

Methods

Databases and data curation

Table 1. Compound databases analyzed in this work and summary of the scaffold diversity.

Databases overlap

Functional group analysis

Chemical space visualization

Global diversity

Results and discussion

Update of BIOFACQUIM

Database overlap

Figure 1. Overlap between BIOFACQUIM, reference natural products (NPs) and ChEMBL.

Functional group analysis

Figure 2. Compounds with unique functional groups from BIOFACQUIM.

Chemical space visualization

Figure 3. Visual representation of the chemical space covered by BIOFACQUIM.

Global diversity

Figure 4. Consensus diversity plot of BIOFACQUIM.

Conclusions

Data availability

Underlying data

Extended data

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated