Functional group and diversity analysis of BIOFACQUIM: A Mexican natural product database

Background: Natural product databases are important in drug discovery and other research areas. An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases. BIOFACQUIM is an emerging database of natural products characterized and isolated in Mexico. Herein, we discuss the results of a first systematic functional group analysis and global diversity of an updated version of BIOFACQUIM. Methods: BIOFACQUIM was augmented through a literature search and data curation. A structural content analysis of the dataset was performed. This involved a functional group analysis with a novel algorithm to automatically identify all functional groups in a molecule and an assessment of the global diversity using consensus diversity plots. To this end, BIOFACQUIM was compared to two major and large databases: ChEMBL 25, and a herein assembled collection of natural products with 169,839 unique compounds. Results: The structural content analysis showed that 15.7% of compounds and 11.6% of scaffolds present in the current version of BIOFACQUIM have not been reported in the other large reference datasets. It also gave a diversity increase in terms of scaffolds and molecular fingerprints regarding the previous version of the dataset, as well as a higher similarity to the assembled collection of natural products than to ChEMBL 25, in terms of diversity and frequent functional groups. Conclusions: A total of 148 natural products were added to BIOFACQUIM, which meant a diversity increase in terms of scaffolds and fingerprints. Regardless of its relatively small size, there are a significant number of compounds and scaffolds that are not present in the reference datasets, showing that curated databases of natural products, such as BIOFACQUIM, can serve as a starting point to increase the biologically relevant chemical space.


This article is included in the Chemical Information gateway. Science
This article is included in the Mathematical, collection. Physical, and Computational Sciences Introduction Natural product-based drug discovery continues to be an important part of drug discovery. Recently, the synergy between natural product research with molecular modeling and chemoinformatics is gaining importance, speeding up the drug discovery process 1,2 . As part of these synergistic efforts, curated databases of natural products have an important role as they are major tools for data mining, hypothesis generation, and starting points of virtual screening. There are several databases of natural products in the public domain as reviewed recently 3 . Our research group has reported initial efforts to assemble a database of natural products from Mexico called BIOFACQUIM 4 . As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage. However, detailed functional group (FG) content analysis, which has been proven to be valuable to characterize compound databases 5 , in particular from natural sources 6 , has not been reported for BIOFACQUIM. One of the main reasons is that most of the currently available software employed for identification of functional groups rely on a predefined set of substructures, even when it has been established that one of the major features that discriminate natural products from synthetic compounds are their unique functional groups.
Herein, we report a functional group content analysis of an updated version of BIOFACQUIM. We employed a validated and novel algorithm that identifies all functional groups in a molecule. As part of the analysis and to compare the results of BIOFACQUIM we also discuss the functional group contents of other large and related databases in the public domain, namely ChEMBL 25 7 and a herein assembled collection of natural products (NPs) with 169,839 compounds.

Methods
Databases and data curation As described elsewhere, the first version of BIOFACQUIM was developed as a proof-of-concept database applying several filters to include compounds 4 . Briefly, the database was focused on natural products published between 2000 and 2018 by research groups in a major Mexican institution in eight indexed journals. As additional criteria for inclusion of compounds and to increase the quality and reliability of the contents of the database, the articles should have described the procedure for the isolation, purification, and characterization of the natural product. In this work, we decided to expand the contents of the BIOFACQUIM database to further explore the diversity of natural products from Mexico.
The second version of BIOFACQUIM was assembled using the same methodology described to develop the first version 4 . To achieve the objective of being representative of Mexico, it was decided to augment the number of Mexican institutions (universities, research laboratories, and research centers) publishing information of novel natural products. For the new version of the database, the same procedure for the curation was performed 4 , using Molecular Operating Environment (MOE) software. A similar procedure can be done using RDKit. In addition, it was ensured to store in the database, natural products that were collected in Mexico, only. The updated and curated version of BIOFACQUIM contains 531 compounds. Table 1 summarizes the information of BIOFACQUIM and other major compound databases used in this work as reference: ChEMBL 25 as a representative example of the biologically tested chemical space with 1,667,893 unique compounds; and a collection of known natural products with a total of 169,839 molecules. The reference natural product collection was assembled from three general and publicly available natural products databases: the Universal Natural Products Database (UNPD) 8 , the Natural Products Atlas 9 and Natural Products in PubChem Substance Database 10 . The data sets were curated using the same procedure. Briefly, compounds consisting of multiple components (e.g. salts) were split and the largest component (defined by the number of heavy atoms) was retained, as long as the number of heavy atoms of the second largest component was less than 70% of the largest component. If this was not the case, the compounds were removed from the dataset, unless the two largest components were identical. Single component compounds as well as the retained component of multiple component compounds were neutralized and canonical simplified molecular-input line-entry system (SMILES) (ignoring stereochemistry information) were generated as molecular representation. Compounds consisting of any element other than H, B, C, N, O, F, Si, P, S, Cl, Se, Br and I, as well as compounds with valence errors, were removed from the dataset. Duplicate structures in the context of each database were also removed. The entire process was performed by using the open source cheminformatics toolkit RDKit.

Databases overlap
Overlap of BIOFACQUIM with the databases selected as reference was assessed in terms of three different structural levels: compounds, scaffolds, and functional groups. Compound overlap was determined in terms of canonical SMILES. For scaffold comparison we use the definition proposed by Bemis and Murcko 11 as implemented in RDKit, while for functional group overlap we selected the recently published definition and implementation suggested by Ertl 5 . For each structural level, we identified the unique structures belonging to each dataset as well as those belonging to two or three of them.

Functional group analysis
For the functional group content analysis we selected the algorithm recently described by Ertl 5 , which is able to identify all functional groups in a molecule based on an iterative marching through its atoms. In short, the proposed algorithm identifies all heteroatoms in a molecule, all atoms connected by multiple bonds as well as the atoms in oxirane, aziridine, and thiirane rings. Afterwards, all connected atoms are joined together to form a functional group. Single aromatic heteroatoms are retained only if they are connected to an additional aliphatic functionality. Finally, a generalization scheme is applied in which for a defined list of common FGs, information about the parent carbon is retained (e.g. to differentiate between alcohols and phenols) as well as hydrogen atoms (e.g. to differentiate between aldehydes and ketones). The method is fully described in 5. An open source version of this algorithm is available for Python (https://github.com/rdkit/rdkit/tree/master/Contrib/IFG); however, it does not cover the generalization scheme proposed originally. To this end and based on the code available, we implemented with RDKit a fragmentation approach considering the retainment of parent carbons and hydrogens proposed originally, were the remaining carbon atoms are replaced by dummy atoms. This implementation works over a SMILES string and returns a list with the canonical SMILES of the FGs identified in the molecule. The code is freely available at GitHub (https://github.com/DIFACQUIM/IFG_General). After determining the FGs content of the different datasets, we compare the proportion of the most frequent FGs at each library.

Chemical space visualization
In order to generate a visual representation of the chemical space covered by the analyzed databases, six molecular properties of pharmaceutical interest were computed for each unique compound: averaged molecular weight (AMW) partition coefficient octanol/water (SlogP), number of hydrogen bond donors (HBD), number of hydrogen bond acceptors (HBA), number of rotatable bonds (RB), and topological polar surface area (TPSA). The correlation matrix was computed and a visual representation of the chemical space was generated by using the two properties with the lowest correlation: AMW and SlogP. All calculations were done with RDKit.

Global diversity
The "global" or total diversity of the datasets was analyzed through the Consensus Diversity (CD) Plot 12 that was designed to capture, in a two-dimensional graph, the diversity of chemical libraries considering different and complementary criteria including molecular fingerprints, molecular scaffolds, physicochemical properties, and size. For this work, scaffold diversity was assessed as the area under the cyclic system retrieval curve and the fraction of chemotypes that covers 50% of the dataset (F 50 ). The median of the lower triangle from the pairwise similarity matrix computed as the Tanimoto coefficient of both MACCS keys (166-bits) fingerprint and Morgan fingerprint with radius 2 (Morgan2), were used as molecular fingerprintbased diversity. The mean distance of the lower triangle of the pairwise distance matrix computed as the Euclidean distance of the six molecular properties scaled (mean 0 and unit variance) and the number of compounds in each dataset were selected as measures of the physicochemical properties-and size-based diversity, respectively. The CD Plot was constructed using R.

Results and discussion
Update of BIOFACQUIM As described in the Methods section, the updated BIOFAC-QUIM database contains the chemical structure of 531 compounds, all collected from Mexico. As with the first version 4 , each molecule is annotated with information of the chemical structure, the original source of the information (Digital Object Identifier, DOI, to reference paper), kingdom, genus, and species of the plant the natural product was isolated, place of recollection (city and state), and activity value of the reported biological activity. From the original dataset containing 423 compounds, 40 were discarded since they were not collected in Mexico, which means an increase of 148 unique compounds compared to the previous release of the database.

Database overlap
To assess the chemical space not covered by ChEMBL and NPs but by BIOFACQUIM, we characterized the structural content of the three datasets in terms of unique compounds, scaffolds and functional groups and determined the overlap among them. Figure 1 depicts Venn diagrams showing the overlap among those datasets. It should be noted that despite its small size in comparison with ChEMBL and NPs, 16.1% of compounds present in BIOFACQUIM have not been reported in those other major datasets, as well as 11.3% of its scaffolds and 6.3% of its FGs. Another remarkable observation is the fact that most of the overlap of BIOFACQUIM with the other datasets involves NPs, either alone or in combination with ChEMBL, being 79.7%, 85.5%, and 93.8% for compounds, scaffolds and FGs, respectively.

Functional group analysis
A systematic analysis of functional groups was carried out over BIOFACQUIM and the two datasets selected as reference. 80, 4300, and 12,677 functional groups were identified in BIOFACQUIM, NPs and ChEMBL datasets, respectively (the overlap between them is shown in Figure 1).  16 . Figure 2 shows the chemical structure of those compounds.

Chemical space visualization
For visualization of the chemical space of the datasets compared in this work, we computed six physicochemical properties of pharmaceutical relevance. As described in the Methods section, we computed the correlation matrix between them (Extended data, Table S1 in Supplementary File 2) and since all descriptors were highly correlated (correlation coefficient > 0.60), with the exception of SlogP, we selected this and its lowest correlated property, AMW, as the axes to project the libraries. We also carried out a principal component analysis over the scaled properties. As expected from the correlation matrix, the first principal component was associated with all properties but SlogP and in the opposite way for the second principal component, with explained variances of 69% and 19% respectively. Therefore, the projection of the data over those two components looks quite similar to the selected properties (Slogp and AMW) (Extended data, Figure S1 in Supplementary File 2), with the difference that the principal components do not have the physical interpretation that properties do, e.g. the orally available drugs space as described by Lipinski 17 . Figure 3 shows a visual representation of the chemical space of the three datasets analyzed in this work. For this plot, in order to better illustrate the unique compounds present in BIOFACQUIM, those compounds belonging to two libraries were assigned to a single one: ChEMBL if they belong to this dataset and NPs if they belong to this dataset but not to ChEMBL. Compounds shared among the three libraries were assigned to a different category. This figure shows that ChEMBL   is the most widespread dataset in terms of physicochemical properties and that NPs, BIOFACQUIM, as well as the shared compounds among the three datasets, are mostly contained within this space. It could be noted that a few exceptions from BIOFACQUIM cover a small part of the space currently unexplored by ChEMBL and it is also remarkable that contrary to ChEMBL and NPs, most BIOFACQUIM compounds are part of the orally available drugs space. Different panels showing all different overlaps between datasets can be found in Extended data ( Figure S2 in Supplementary File 2).

Global diversity
In order to compare the chemical diversity of the current version of BIOFACQUIM with the previous one and the two datasets selected as reference, we employed a CD plot Figure 4 shows the plot comparing the diversity of all datasets considering four different criteria: scaffolds in the y-axis, molecular fingerprints in the x-axis, physicochemical properties as the filling of the data points in a continuous color scale, and number of compounds as the data points size. This comparison shows the relatively small size of BIOFACQUIM in comparison with the reference datasets. As compared to the previous release of BIOFACQUIM, the current version has increased its diversity in terms of scaffolds and fingerprints but decreased in terms of physicochemical properties. Also, it is shown that its diversity in terms of molecular fingerprints and physicochemical properties, although not the greatest ones of the three datasets, are on the order of those for NPs, contrary to scaffolds, in which is the most diverse.

Conclusions
The current version of BIOFACQUIM involved the addition of 148 natural products. This was reflected in a diversity increase based on both scaffolds and molecular fingerprints. It was shown that in terms of diversity and structural content overlap, BIOFACQUIM is more similar to the assembled set of natural products than to the set of biologically tested compounds. The herein reported chemoinformatic study revealed that most of the compounds contained in BIOFACQUIM are focused in the orally active drugs space in terms of physicochemical properties. Interestingly, despite the fact of its relative small size, there were identified a significant number of compounds, scaffolds and functional groups (81, 28 and 5, respectively) that were not present in the two large sets used as reference, showing that curated databases of natural products, such as BIOFACQUIM, can serve as a starting point for the study and increase of the biologically relevant chemical space. This file contains the chemical structures of 531 compounds in SDF format, alongside ID number, compound name, simplified molecular input line entry system, literature reference,  o Figure S1. Visual representation of the chemical space covered by the three compound libraries generated by principal component analysis.

Data availability
o Figure S2. Visual representation of the chemical space covered by the possible overlaps between libraries.
Extended data are available under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0).
No competing interests were disclosed.
While the paper is well written, there are a few areas where changes in the text may make the presentation a bit more clear: The second sentence in the abstract could be changed to read: "An analysis of its structural content, as well as functional group occurrence, provides a useful overview, as well as a means of comparison with related databases." In the first sentence of the "Methods" section, the authors may want to consider changing "done" to "performed".
In the Introduction, the sentence "As part of that work, it was reported an initial scaffold content, and chemical space diversity, and coverage" could be changed to "As part of that work, scaffold content and chemical space diversity were examined".
In the first paragraph of the "Results and discussion" section, "place of recollection" would be better stated as "place of collection".
In the methods section, the authors detail their procedure for "cleaning" the databases. It would be useful to others if the source code for these procedures was made available as part of the GitHub repository associated with this paper. The authors used canonical SMILES to identify exact matches between compounds in different databases. While this is a reasonable approach, canonical SMILES do not standardize tautomers and may miss some duplicates. An alternate approach would be to use InChI keys, which canonicalize tautomeric forms.
The description of the "Consensus Diversity (CD) Plot" in the "Global diversity" section is a bit vague. An additional paragraph describing this method would be useful.
additional paragraph describing this method would be useful. The density of points in Figure 3 obscures many of the points in the plot. The authors may want to consider using a plotting technique like hexagon binning which enables the plotting of large datasets without obscuring points. For example, see these links: https://datavizproject.com/data-type/hexagonal-binning/ https://rdrr.io/cran/som.nn/man/hexbinpie.html I'm not convinced that plotting molecular weight vs LogP is the best way to represent the diversity and overlap of datasets. An alternative might be to use a technique like t-SNE to project a set of fingerprints into a lower dimensional space: . https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html It might also be interesting to provide another plot which only shows the molecules from the three datasets studied that are in "drug-like" chemical space rather than the very large range shown in Figure 3.
In summary, this paper provides a valuable resource to researchers working in a number of different areas. Hopefully, these suggestions will provide minor enhancements to the work.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: Computational Chemistry, Cheminformatics, Drug Design I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Reviewer Expertise: cheminformatics, natural products research I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com