Keywords
Biodiversity; Chemical diversity; Chemoinformatics; Natural products; Chemical multiverse; Chemical space
This article is included in the Cheminformatics gateway.
Natural products databases are well-structured data sources that offer new molecular development opportunities in drug discovery, agrochemistry, food, cosmetics, and several other research disciplines or chemical industries. The crescent world’s interest in the development of these databases is related to the exploration of chemical diversity in geographical regions with rich biodiversity.
In this work, we introduce and discuss Nat-UV DB, the first natural products database from a coastal zone of Mexico. We discuss its construction, curation, and chemoinformatic characterization of their content, and chemical space coverage compared with other compound databases, like approved drugs, and other Mexican (BIOFACQUIM and UNIIQUIM databases) and the Latin American natural products database (LaNAPDB).
Nat-UV DB comprises 227 compounds that contain 112 scaffolds, of which 52 are not present in previous natural product databases. The compounds present in Nat-UV DB have a similar size, flexibility, and polarity to previously reported natural products and approved drug datasets.
Nat-UV DB compounds have a higher structural and scaffold diversity than the approved drugs, but they have low structural and scaffold diversity in contrast with other natural products in the reference datasets. This database serves as a valuable addition to the global natural products landscape, bridging gaps in exploring biodiversity-rich regions.
Biodiversity; Chemical diversity; Chemoinformatics; Natural products; Chemical multiverse; Chemical space
Mexico is one of the most biodiverse countries in the world, which has a large list of endemic organisms.1,2 At the same time, the state of Veracruz, Mexico, is a coastal region next to the Gulf of Mexico, which has a large diversity in its geographic landscapes and weather, conditions, which have contributed to the increase in biodiversity, and is considered one of the most biodiverse states in the country.3 It has been reported that the state of Veracruz houses 34% of the total species in Mexico, which highlights the importance of the systematic study of their chemical diversity.4
Natural products have demonstrated their key role in developing new drugs, materials, nutraceuticals, pesticides, and insecticides, which justify their study.5,6 Nowadays it is possible to establish structure-properties relationships using bioinformatics and chemoinformatics methodologies.7 To achieve this goal, it is necessary to condense, organize, and curate the databases. Recent efforts in Latin America have been developed on the construction of natural products databases that have contributed to the understanding of Latin American traditional medicine, and to accelerate the rational use of natural products in this geographical region.8 Particularly in Mexico, there are two compound databases9,10 that are mainly focused on the natural products identified in the central zone of Mexico. However, there are no reports of databases from the most biodiverse regions in this country.
Figure 1 illustrates representative chemical structures that have been obtained from natural resources collected in the state of Veracruz, which are distinguished by their structural diversity,11–21 and their great applicability domain in medicine (e.g. to develop antimicrobial and anticancer drugs), cosmetology (e.g. to develop skincare molecules), nutrition (e.g. to develop nutraceuticals), agriculture (e.g. to develop biopesticides and insecticides), and many other applications.22–25
In the present scientific context, it is possible to establish highly efficient virtual screening protocols using chemoinformatics methods, natural product databases have covered a worldly interest in the past 20 years.26,27 Multiple commercial and open-access natural product databases are available as valuable resources for molecular design. It is expected that databases will continue to grow in number and type. For example, focusing their creation on the organization of data and information on natural products based on their reported biological activity, chemical characterization method, geolocation, natural source, commercial availability, etc.27 Web applications like COCONUT 2.0 (the COlleCtion of Open NatUral producTs) is an an excellent resource freely available at https://coconut.naturalproducts.net/ to unify and standardize multiple natural product databases,28 which facilitates the systematic filtering of multipurpose data useful for chemoinformatic and natural products research.
The main objective of this work is to introduce Nat-UV DB, a database of natural products isolated and characterized in the state of Veracruz, Mexico. We also discuss a systematic analysis using chemoinformatics methods, identifying endemic natural products, and studying their chemical diversity.
The database of natural products from the state of Veracruz was assembled from a literature search. For the construction of the first version of NAT-UV DB, PubMed, Google Scholar, Sci-Finder, Redalyc, and the institutional repository of the Universidad Veracruzana (Mexico) databases were searched using the keywords “natural product”, “NMR”, and “Veracruz.” We collected information from research articles, and bachelor, master, and doctorate theses from universities and research centers. To complement the data mining, two additional criteria were used for the final selection of the literature used to construct the database. The first filter was that the elucidation of the reported chemical structures has been supported by nuclear magnetic resonance (NMR). The second one was that the compounds identified were obtained from a natural source from any region in the state of Veracruz (Mexico). The search was generated by publication year from 1970 to June of 2024. We want to emphasize that this is the first version of Nat-UV DB; future versions will have natural products from more years, and more research repositories, to assemble a database representative of the entire biodiversity of the state of Veracruz. For each collected molecule, their isomeric SMILES strings29 were generated with ChemBioDraw Ultra V.13, maintaining the stereochemistry reported in the primary literature.30 With the module’Wash’, from the molecular operating environment (MOE) program, version 2024,31 the database was curated, maintaining without changes the stereochemistry reported of each molecule. This was done to normalize and collect the most relevant information from the molecules. The data curation involved the elimination of salts, the adjustment of the protonation states, and the elimination of the duplicated molecules. The default settings of the ‘Wash’ module were used. The information collected for each identified compound is organized according to the natural origin of its place of collection, like kingdom, genera, species, and geographical collection. Finally, the list of curated compounds was manually cross-referenced to PubChem32 and ChEMBL v.3433 databases, which enabled the annotation of databases with the bioactivities that have been associated with each chemical structure ( Figure 2). For those compounds reported in theses, and which were evaluated in a biological test, the biological activity was also included in the database.
In order to characterize the chemical diversity of Nat-UV DB and to explore its chemical space coverage, approved drugs34 and the Latin American natural products compound database (LaNAPDB)35 were used to compare their chemical structures and properties. The structure files used in this work were taken from open repositories of previously published analyses of natural products databases.36 The structures of the reference compounds were curated using the same procedure described to prepare Nat-UV DB. Table 1 summarizes Nat-UV DB and the reference databases and the number of compounds. Of note, the reference collections included data sets of natural products, including two from Mexico.
Database | Description | Size* | Reference |
---|---|---|---|
Approved drugs (DrugBank v. 2024.0) | Drugs approved for clinical use | 2,144⧫ | 34 |
LANaPDB 2.0 | Latinoamerican natural products database with chemicals from Brazil, Colombia, Costa Rica, El Salvador, Mexico, Panama, and Peru | 13,579 | 36 |
BIOFACQUIM | Natural products from, Mexico | 531 | 37 |
UNIIQUIM | Natural products from, Mexico | 855 | 10 |
Nat-UV DB | Natural products from the state of Veracruz (Mexico) | 227 | - |
The curated Nat-UV DB database was characterized by calculating six physicochemical properties of pharmaceutical interest, namely: molecular weight (MW), octanol/water partition coefficient (ClogP), topological surface area (TPSA), number of rotatable bonds (RB), number of H-bond donor atoms (HBD), and number of H-bond acceptor atoms (HBA). The statistical analysis was done with the program DataWarrior v.06,38 by calculating the mean, median, and standard deviation of the calculated properties. Based on these statistics Nat-UV DB was further compared with other databases (LANAPDB, BIOFACQUIM, UNIIQUIM, and approved drugs from DrugBank) ( Table 1). The systematic comparison was generated using the Python programming language. The code is freely available at https://github.com/EdgL2/Nat-UV-DB.
The most frequent and unique molecular scaffolds of Nat-UV DB and reference databases ( Table 1) were computed using the scaffold definition of Bemis and Murcko.39 This analysis was done using Python, the code of which is freely available at https://github.com/EdgL2/Nat-UV-DB.
In order to generate a visual representation of the chemical space of Nat-UV DB, the fingerprint ECFP4 (1024 bits) was calculated for each compound40 and the visualization was done using t-distributed stochastic neighbor embedding (t-SNE).41 The selection of this visualization method was based on recent studies that support its utility for the systematic study of small and large datasets in terms of neighborhood preservation and visualization capabilities.42 The ECFP4 fingerprint and the t-SNE coordinates were calculated in KNIME software. The optimization parameters we used in t-SNE were dimensions (3), iterations (10,000), theta (0.3), perplexity (30.0), and number of threats (8), using 28 as the seed number. The interactive visualization was implemented using DataWarrior software, version 06.43 The KNIME workflow and data generated are freely available in the Software availability section.
To compare the chemical diversity of Nat-UV DB with the reference data sets, we employed the consensus diversity (CD) plot which is a simple two-dimensional graph that helps to visualize and compare the diversity of several compound data sets considering multiple representations such as chemical scaffolds, and fingerprint-based diversity.44 In this study, the CD plot was generated using the median paired similarity (ECFP4-1024 bits)/Tanimoto; x-axis) and the median paired scaffold similarity (Bemis-Murck representations using ECFP4-1014 bits/Tanimoto; y-axis).44 Both are established and are representative metrics of the scaffold and fingerprint-based diversity.45 Subsets of the compounds were retrieved from control data sets ( Table 1). The workflow implemented in KNIME software is available in the Software availability section.
In this section, we present the results of the construction of the Nat-UV database followed by a descriptive analysis of the contained data, and the chemoinformatic characterization in terms of physicochemical properties, scaffold content, chemical space coverage, and consensus chemical diversity.
As described in the Methods section, the scientific papers and thesis that complied with the inclusion criteria were selected. Each of the 45 scientific documents selected (1 Doctorate thesis degree; 8 Master thesis degrees; 36 research articles) was analyzed individually to extract manually the chemical structures of each identified natural product. The Nat-UV DB contains information that allows identifying the bibliography precedence of the data. For example: compound name, reference, digital object identifier (DOI), and publication year. Also, it contains data related to the natural source precedence of the data. For example: kingdom, genus, species, and geographical location of the collection of the natural source. Additionally, we added cross-referenced IDs with other databases (e.g. PubChem and ChEMBL). Finally, we manually cross-referenced each compound with their reported bioactivity contained in ChEMBL v. 34.
The current version of Nat-UV DB has 227 compounds collected from different geographical zones of Veracruz ( Figure 3A), mainly isolated by different kinds of gender plants ( Figure 3B). For example, the gender Hyptis, Capsicum, Nidema, Dryopteris, Ipomoea, Azadirachta, Hamelia, Croton, and Guarea are examples of the most frequently studied. Other species from other kingdoms also stand out as Aspergillus, Ganoderma, Colletotrichum, and Aegiale ( Figure 3C). Figure 1D illustrates the distribution of compounds per year reported since 1970 to date. Finally, 79% of the compounds contained in this database have been associated with almost one bioactivity report ( Figure 2E, D) which highlights compounds with anticancer and antimicrobial (antibacterial or antifungal) activities.
(A) Geographical collection of natural resources studied in this work, and the number of compounds obtained by each region; (B) Quantification of compounds contained in this database by genus; (C) Quantification of compounds contained in this database by specie precedence; (D) Number of isolated compounds by decades; (E) Associated bioactivity for the compounds contained in this database; and (F) Multi-activity landscape of compounds contained in this database.
From the total number of compounds contained in Nat-UV DB (227 compounds), 112 scaffolds were identified, of which 52 (52/112; 46%) are unique. Namely, Nat-UV DB contains scaffolds (52) that have not been reported previously in other Latin American datasets, and that are not present in the scaffolds collection of approved drugs ( Figure 4A). The most representative unique scaffolds of Nat-UV DB are shown in Figure 4B, highlighting the presence of derivatives of limonoids, butyrolactones, flavones, pentacyclic triterpenes, etc. The full list of unique scaffolds is available in the Data availability section.
(A) Shared scaffolds of Nat-UV DB and reference datasets for natural products (LANaPDB, BIOFACQUIM, and UNIIQUIM) and approved drugs (Drugbank); (B) Representative unique scaffolds contained in Nat-UV DB.
There are previous reports of two Mexican natural products databases ( Table 1), but the BIOFACQUIM database is the unique one associated with collected geographical data. Interestingly, 74 compounds contained in this dataset were collected in the state of Veracruz (Mexico). This explains that 32 (32/112; 28%) scaffolds are shared in both databases ( Figure 4A). Also, 17 (17/112; 15%) scaffolds are shared between Nat-UV DB and UNIIQUIM, while 53 (53/112; 47%) scaffolds are shared between Nat-UV DB and the LaNaPDB. Finally, 24 (24/112; 21%) scaffolds were shared between Nat-UV DB and the approved drugs collection. In other words, Nat-UV DB contains some natural product scaffolds (60/112; 54%) that have been identified previously in Mexico and other Latin American countries or have been used as a drug.
Figure 5 shows a violin plot of the distribution of the six drug-likeness properties calculated for Nat-UV DB. The distribution of the same properties for the two references used in this work was included in comparing the violin plots. ( Table 1). Intrinsic molecular properties like size, flexibility, and polarity are described by explicit molecular properties like weight (MW), coefficient of octanol/water partition (CLogP), number of H-acceptor and H-donors bonds, polar surface area (PSA), and number of rotatable bonds (RB) ( Figure 5A). Summary statistics are presented at the bottom of the violin plots ( Figure 5B).
(A) The boxes inside of violins enclose data with values within the first and third quartile; (B) Summary statistics are included below each l plot. MW: molecular weight; ClogP: octanol/water partition coefficient; H-bond acceptors: number of H-bond acceptor atoms; H-Donors: number of H-bond donor atoms; PSA: polar surface area; RB: number of rotatable bonds.
According to Figure 5 the size (MW, HA, and HB), flexibility (RB), and permeability (PSA) profiling of Nat-UV are comparable with the control datasets. However, the polarity (CLogP) of the compounds contained in Nat-UV DB, LANaPDB, BIOFACQUIM, and UNIIQUIM is higher than the approved drugs, however, the Nat-UV DB exhibited a shorter distribution than each natural products databases. This finding agrees with previous reports indicating that natural products are slightly more hydrophobic than drugs approved for clinical use.36
Figure 6 shows a visual representation of the chemical space of Nat-UV DB based on ECFP4 fingerprint using t-SNE. Figure 6(B-E) compares Nat-UV DB with other natural products databases (i.e. LANaPDB, BIOFACQUIM, and UNIIQUIM) and approved drugs. Interestingly, Nat-UV DB shares part of its chemical space with the approved drugs dataset, but Nat-UV DB compounds are more distributed in the three dimensions of the plot, which suggests that have a higher structural diversity than the approved drug dataset. However, LANaPDB (the largest dataset analyzed in this study) has an apparently higher structural diversity than the other studied datasets. To quantify the diversity of each dataset, the calculation of structural diversity and scaffold diversity were done ( Figure 6F). To quantify the diversity of each dataset, we calculated the mean of the paired similarity of the structures (x-axis) and scaffolds (y-axis) based on the similarity of each pair of compounds in the dataset using ECFP4 fingerprint and Tanimoto coefficient, where the higher values confirm a higher structural or scaffold diversity of the dataset. These results showed that Nat-UV DB has a higher structural and scaffold diversity than the approved drugs. However, it has low structural and scaffold diversity in contrast with UNIIQUIM and LANaPDB. Finally, Nat-UV DB shows a higher structural diversity than BIOFACQUIM, but a lower scaffold diversity than this one.
(A) Nat-UV DB; (B) Nat-UV DB and Drugbank (approved drugs); (C) Nat-UV DB and LANaPDB; (D) Nat-UV DB and BIOFACQUIM; (E) Nat-UV DB and UNIIQUIM; and (F) Consensus diversity plot of Nat-UV DB and the four reference datasets.
Nat-UV DB is a compound database of natural products from the state of Veracruz in Mexico, which is a coastal zone reported with a large biodiversity. The open-access database contains 227 compounds reported from 1970 to June 2024, which is available at https://github.com/EdgL2/Nat-UV-DB. The compound database contains information of bibliographic resources for each compound, information about the collected species that come from, and cross-referenced bioactivity data. The chemoinformatic characterization and analysis of the coverage and diversity of Nat-UV DB in the chemical space suggest broad coverage, overlapping with regions in the approved drugs chemical space. The analysis also indicated that there are unique compounds in Nat-UV DB concerning other Mexican and Latin American natural products databases. The main perspectives of this work are to use Nat-UV DB to identify active compounds using virtual screening methods and continue to augment the size of Nat-UV DB from the new natural products that would be identified in the state of Veracruz, Mexico.
Zenodo: Nat-UV DB Data Availability. The datasets used in this work. https://doi.org/10.5281/zenodo.14715820.46
This project contains the following underlying data:
• FinalDB_ForPaper_DB_cured.csv: The Nat-UV compounds and approved drugs datasets.
• Most_Frecuent_Scaffolds_NatUVDB.xlsx: The most frequent scaffolds contained in Nat-UV DB.
Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
Source code available from: https://github.com/EdgL2/Nat-UV-DB.47
Archived software available from: https://doi.org/10.5281/zenodo.14715820.47
The codes and workflow are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
ELL and AMHS thank the Consejo Nacional de Humanidades, Ciencias y Tecnología (CONAHCYT) for the scholarships 762342 (No. CVU: 894234) and 4011825 (CVU: 1322038), respectively.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational medicinal chemistry
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: My research lines focus on generating knowledge in the medicinal chemical study of synthetic molecules and natural products and Theoretical chemistry.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 2 (revision) 24 Apr 25 |
read | ||
Version 1 04 Feb 25 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)