ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Article

Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer

[version 1; peer review: 3 approved]
PUBLISHED 11 Mar 2014
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Data: Use and Reuse collection.

Abstract

In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).

Introduction

The compound data sets reported in our original article1 and the new data sets presented herein have resulted from research in the chemoinformatics and medicinal chemistry area and have mostly been generated from public domain repositories of compound structures and activity data. In addition, software tools made publicly available have also been developed in our laboratory1. Data sets reported in the scientific literature in the context of computational method development and evaluation are often not publicly available, which limits the reproducibility of computational investigations and comparisons of different computational methods. We believe that it is important to provide such data to the scientific community to further improve the transparency and credibility of computational studies and support method development. In addition to the data sets designed for the development and evaluation of computational methods, we also make available data sets that were generated as a resource and knowledge base for medicinal chemistry applications. Our data sets and tools are provided via the ZENODO platform (https://zenodo.org/) to ensure easy and stable access.

Materials and methods

The data sets reported herein were predominantly generated from ChEMBL2,3, BindingDB4 and PubChem5 (a few exceptions are specified in the original data article1). Compound structures are represented as SMILES6 strings or SD files7. Activity information and other (data set-dependent) annotations are provided in the individual data files. For software tools (written in different languages), the source code is also made available.

Data description

Table 1 provides the updated list and classification of all freely available data sets and programs. Entries were organized according to the following scientific subject areas: data sets for structure-activity relationship (SAR) and structure-selectivity relationship (SSR) analysis, SAR visualization (SAR_VZ), and virtual screening via similarity searching or machine learning (VS_ML). In addition, the programs are provided separately (PROG). Data sets and programs are contained in separate ZENODO deposition sets with a unique reference. Three matched molecular pair (MMP)-based data sets also included in our update have recently been reported and described in detail8. Entries 1–30 in Table 1 represent the data sets and programs that we initially provided via our website1 and entries 31–43 represent new data sets. In the following, the new data sets are described:

Table 1. Data sets and programs.

EntryYearSubject area
index label
Description
1[9]2007VS_ML_19 activity classes (AC) with increasing structural diversity
2[9]2007VS_ML_2~1.44 million ZINC compounds used for various virtual screening trials
3[10]2007PROG_1Molecular similarity histogram filtering
4[11]2007SSR_14 SD files with 26 selectivity sets; compounds are annotated with selectivity values for different targets
5[12]2008SSR_27 compound selectivity sets containing 267 biogenic amine GPCR antagonists
6[13]2008SSR_318 selectivity sets for targets from 4 families
7[14]2008VS_ML_325 sets of compounds of increasing complexity and size
8[15]2009VS_ML_4242 hERG inhibitors
9[16]2009SSR_4243 ionotropic glutamate ion channel antagonists
10[17]2009PROG_2Combinatorial analog graph (CAG) program with a sample set consisting of 51 thrombin inhibitors
11[18]2009VS_ML_520 AC from the literature and 15 AC from the Molecular Drug Data Report
12[19]2010VS_ML_68 AC
13[20]2010PROG_3Program to generate target selectivity patterns of scaffolds
14[21]2010PROG_4Multi-target CAGs (see also entry 10) with a sample set containing 33 kinase inhibitors
15[22]2010PROG_5SARANEA
16[23]2010PROG_63D activity landscape program with a sample set containing 248 cathepsin S inhibitors
17[24]2010SAR_12 sets of MMPs from BindingDB and ChEMBL
18[25]2010PROG_7Similarity-potency tree (SPT) program with a sample set containing 874 factor Xa inhibitors
19[26]2010VS_ML_717 target-directed compound sets; each set contains a minimum of 10 distinct scaffolds and each
scaffold represents 5 compounds
20[27]2011SAR_VZ10,489 malaria screening hits
21[28]2011SAR_2458 target-based sets with scaffolds and scaffold hierarchies
22[29]2011SAR_VZ4 sets of compounds active against 3 or 4 targets
23[30]2011SAR_VZ881 factor Xa inhibitors
24[31]2011VS_ML_850 AC prioritized for similarity searching
25[32]2011VS_ML_925 data sets from successful ligand-based virtual screening applications
26[33]2011SAR_326 conserved scaffolds in activity profile sequences of length 4
27[34]2011PROG_8Scaffold distance function
28[35]2011SAR_42 sets of compounds with multiple Ki or IC50 measurements against the same targets that differed within
1 order of magnitude
29[36]2012SAR_VZ4 AC
30[37]2012SAR_55 sets of different types of activity cliffs
31[38]2012VS_ML_1050 AC for scaffold hopping analysis
32[39]2012SAR_661 AC consisting of SAR transfer series with regular potency progression
33[40]2013SAR_74 activity measurement type-dependent sets of scaffolds
34[41]2013VS_ML_112 multi-target compound sets
35[42]2013VS_ML_124 multi-target compound sets and 3 multi-mechanism sets
36[43]2013SAR_82337 compound series matrices
37[44]2013SAR_9128 AC containing ≥100 compounds with Ki values
38[45]2014SAR_1030,452 and 45,607 target-based MMS with Ki and IC50 values, respectively
39[46]2014SAR_11221 drug-unique scaffolds
40[47]2014SAR_1292,734 MMPs based upon retrosynthetic rules for 435 AC
41[8]2014SAR_1320,073 and 25,297 MMP-based activity cliffs with Ki and IC50 values, respectively
42[8]2014SAR_144 activity measurement type-dependent sets of SAR transfer series with approximate or regular
potency progression
43[8]2014SAR_15169,889 and 240,322 transformation size-restricted MMPs based upon retrosynthetic rules with Ki and
IC50 values, respectively

Data entries are organized according to scientific subject areas: structure-activity relationship (SAR) and structure-selectivity relationship (SSR) analysis, SAR visualization (SAR_VZ), virtual screening via similarity searching or machine learning (VS_ML), and programs (PROG). References in the Entry column provide the original publication introducing the program and/or data set. Program entries are described in more detail in Table 2 of our original data article1. The new compound data sets 31–43 are discussed in the text. Programs and data sets reported herein have been separately deposited in ZENODO for access and download.

Entry 31

50 compound activity classes (AC) are prioritized for the evaluation of scaffold hopping potential in ligand-based virtual screening38. These AC contain the largest proportion of scaffold pairs with largest chemical inter-scaffold distances38 that can be derived from current bioactive compounds and hence present challenging test cases for scaffold hopping analysis.

Entry 32

596 SAR transfer series with regular potency progression (SAR-TS-RP) are extracted from 61 AC39. Each SAR-TS-RP represents two compound series with different core structures and pairwise corresponding substitutions that yield comparable potency progression against a given target. These series provide a knowledge base for the analysis and prediction of SAR transfer events.

Entry 33

Four sets of molecular scaffolds (with each scaffold representing more than ten compounds) are provided that are active against a single target (ST), multiple targets from the same family (SF), or multiple targets from different families (MF)40. Data sets are separately assembled for different types of potency measurements (i.e., Ki and IC50 values) and provide a resource of scaffolds representing compounds with varying degrees of target promiscuity.

Entry 34

Two multi-target compound data sets consist of confirmed screening hits41. Each set contains compounds with single-, dual-, and triple-target activity, or no activity. These data provide test cases for machine learning or other approaches to differentiate between compounds with overlapping yet distinct activity profiles.

Entry 35

Four multi-target compound data sets are provided42. Each set contains compounds tested in three different assays. Compounds are organized into eight different subsets according to their activity profiles, i.e., single-, dual-, and triple-target activity, or no activity. In addition, three multi-mechanism compound sets are designed42. In the latter case, compounds are organized into four subsets according to their mechanism-of-action. These data sets also represent test cases for machine learning to distinguish compounds with different activity profiles or mechanisms.

Entry 36

2337 non-redundant compound series matrices (CSMs) are generated covering compounds active against a wide spectrum of targets43. Each matrix contains at least two analogous matching molecular series (MMS) with structurally related yet distinct cores. A matrix consists of known active compounds and structurally related virtual compounds and hence provides suggestions for compound design.

Entry 37

128 target-based data sets are assembled that consist of at least 100 compounds with precisely specified equilibrium constants (Ki values) below 1 µM for human targets44. These high-confidence activity data sets provide a sound basis for SAR exploration.

Entry 38

30,452 and 45,607 target-based MMS with Ki and IC50 values, respectively, are extracted from bioactive compounds45.

Entry 39

221 scaffolds are identified that only occur in approved drugs but are not found in currently available bioactive compounds46. Accordingly, these scaffolds have been termed drug-unique scaffolds.

Entry 40

92,734 MMPs are generated from 435 AC on a basis of retrosynthetic rules47. These MMPs consider chemical reaction information and should be useful for practical medicinal chemistry applications.

Entry 41

20,073 and 25,297 MMP-based activity cliffs (i.e. pairs of structurally analogous compounds with an at least 100-fold difference in potency) are extracted from specifically active compounds based upon Ki and IC50 values, respectively8. The MMP-based activity cliffs provide a large knowledge base for SAR analysis.

Entry 42

157 and 513 MMP-based SAR transfer series with approximate potency progression plus 60 and 322 SAR transfer series with regular potency progression based upon Ki and IC50 values, respectively, are isolated from bioactive compounds. These transfer series are active against individual targets8. Similar to MMP-based activity cliffs, SAR transfer series provide a resource for SAR analysis and compound design.

Entry 43

169,889 and 240,322 transformation size-restricted MMPs based upon retrosynthetic rules with Ki and IC50 values, respectively, are systematically extracted from available AC8. Different from the retrosynthetic rule-based MMPs presented above, applied transformation size-restrictions ensure that chemical changes distinguishing compounds in pairs are small.

Summary

Herein we have provided an updated release of data sets and programs for chemoinformatics and medicinal chemistry that we make freely available. In total, 13 new data sets are introduced. Transferring all data entries in an organized form to the ZENODO platform makes them easily accessible. We hope that our current release might be of interest and helpful to many investigators in academia and the pharmaceutical industry.

Data availability

ZENODO: Programs for chemoinformatics and computational medicinal chemistry, doi: 10.5281/zenodo.845148.

ZENODO: Data sets for chemoinformatics and computational medicinal chemistry, doi: 10.5281/zenodo.845549.

Comments on this article Comments (1)

Version 1
VERSION 1 PUBLISHED 11 Mar 2014
  • Reviewer Response 17 Apr 2014
    Chris J Swain, Cambridge Med Chem Consulting, Cambridge, UK
    17 Apr 2014
    Reviewer Response
    Such collections of data sets are absolutely invaluable for testing existing algorithms and for developing new ones.
    Competing Interests: None
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Hu Y and Bajorath J. Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer [version 1; peer review: 3 approved]. F1000Research 2014, 3:69 (https://doi.org/10.12688/f1000research.3713.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 11 Mar 2014
Views
24
Cite
Reviewer Report 22 Apr 2014
Patrick Walters, Vertex Pharmaceuticals Incorporated, Cambridge, MA, USA 
Approved
VIEWS 24
The ability to compare multiple computational methods across a series of consistent, high-quality datasets is critical to the progress of computational chemistry and cheminformatics. In the past, each paper published in the field seemed to present yet another new dataset. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Walters P. Reviewer Report For: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer [version 1; peer review: 3 approved]. F1000Research 2014, 3:69 (https://doi.org/10.5256/f1000research.3979.r4077)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
21
Cite
Reviewer Report 17 Apr 2014
Chris J. Swain, Cambridge Med Chem Consulting, Cambridge, UK 
Approved
VIEWS 21
Building and testing novel computer models requires access to suitable datasets. The authors have compiled a very useful set of interesting datasets and made them readily available in standard formats (SMILES and SDF). This allows others to both test existing ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Swain CJ. Reviewer Report For: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer [version 1; peer review: 3 approved]. F1000Research 2014, 3:69 (https://doi.org/10.5256/f1000research.3979.r4409)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
26
Cite
Reviewer Report 13 Mar 2014
Ajay Jain, HDF Comprehensive Cancer Center, University of California San Francisco, San Francisco, CA, USA 
Approved
VIEWS 26
Hu and Bajorath offer an update to their resource for computational chemistry. The curated data, and its engineered availability, will be of great interest, especially to methods developers. Even those researchers that are interested in exploring larger data sets that ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jain A. Reviewer Report For: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer [version 1; peer review: 3 approved]. F1000Research 2014, 3:69 (https://doi.org/10.5256/f1000research.3979.r4079)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (1)

Version 1
VERSION 1 PUBLISHED 11 Mar 2014
  • Reviewer Response 17 Apr 2014
    Chris J Swain, Cambridge Med Chem Consulting, Cambridge, UK
    17 Apr 2014
    Reviewer Response
    Such collections of data sets are absolutely invaluable for testing existing algorithms and for developing new ones.
    Competing Interests: None
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.