ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Data Article
Revised

Matched molecular pair-based data sets for computer-aided medicinal chemistry

[version 2; peer review: 4 approved]
PUBLISHED 21 Feb 2014
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Matched molecular pairs (MMPs) are widely used in medicinal chemistry to study changes in compound properties including biological activity, which are associated with well-defined structural modifications. Herein we describe up-to-date versions of three MMP-based data sets that have originated from in-house research projects. These data sets include activity cliffs, structure-activity relationship (SAR) transfer series, and second generation MMPs based upon retrosynthetic rules. The data sets have in common that they have been derived from compounds included in the ChEMBL database (release 17) for which high-confidence activity data are available. Thus, the activity data associated with MMP-based activity cliffs, SAR transfer series, and retrosynthetic MMPs cover the entire spectrum of current pharmaceutical targets. Our data sets are made freely available to the scientific community.

Revised Amendments from Version 1

The version of the ChEMBL database used in our analysis and more technical information concerning calculations and toolkits have been provided. Standard MMPs and RECAP-MMPs have been compared in more detail and MMP-cliffs with a potency difference of at least one order of magnitude have also been determined. Table 2 has been updated and a new Figure 4 has been added. The data sets have been updated. In the files of MMP-cliffs and RECAP-MMPs, a column “NumOfCuts” has been added indicating the origin of chemical transformations. Target names have been added in all files. Compound activities have been incorporated in files of RECAP-MMPs. For transfer series, substituted fragments have been provided.

 

To read any peer review reports and author responses for this article, follow the "read" links in the Open Peer Review table.

Introduction

The matched molecular pair (MMP) concept is widely applied in medicinal chemistry14. An MMP is defined as a pair of compounds that are only distinguished by a structural modification at a single site1, i.e., the exchange of a substructure, termed a chemical transformation5. MMPs are attractive tools for computational analysis because they can be algorithmically generated and they make it possible to associate defined structural modifications at the level of compound pairs with chemical property changes, including biological activity24. MMPs are usually chemically intuitive and easily accessible, which helps to bridge the gap between computational analysis and the practice of medicinal chemistry.

In the context of different studies, we have systematically generated MMPs through the mining of publicly available compound activity data. All possible MMPs have been derived from compounds active against currently available pharmaceutical targets. Then, MMPs have been used to explore structure-activity relationships (SARs) on a large-scale and from different viewpoints.

In a previous data article, we have reported and made publicly available a number of different data sets and computational tools developed in our laboratory6. Here we describe three recently developed MMP-based data structures, which should be of interest for SAR analysis and compound design, and we also provide up-to-date versions of the corresponding data sets. It is anticipated that these data sets will be helpful as a resource for computer-aided medicinal chemistry applications. The data sets include MMP-based activity cliffs (i.e., MMP-cliffs), SAR transfer series, and MMPs derived on the basis of retrosynthetic fragmentation rules and were derived from all bioactive compounds currently available in the ChEMBL database (release 17)7,8. Only high-confidence activity data (as specified below) were considered. MMP-cliffs, SAR transfer series, and retrosynthetic MMPs provide comprehensive sources of SAR information. In addition, retrosynthetic MMPs are thought to increase the utility of computational MMP analysis for practical chemistry efforts because these second generation MMPs consider reaction information during molecular fragmentation, which sets them apart from standard MMPs originating from systematic fragmentation of all possible exocyclic single bonds in a molecule (as detailed below).

Materials and methods

Concepts

(1) Activity cliffs are generally defined as pairs or groups of compounds that are structurally similar and have large differences in potency911. Accordingly, activity cliffs usually have high SAR information content (because small chemical changes in similar or analogous compounds lead to large potency effects). The assessment of activity cliffs requires clearly defined similarity and potency difference criteria911. The formation of an MMP can be considered as a similarity criterion, which is similarity metric-free and often chemically more intuitive than the use of calculated molecular similarity11,12. MMP formation as a similarity criterion has led to the introduction of MMP-cliffs12. For MMP-cliffs, a difference in potency of at least two orders of magnitude between cliff-forming compounds was set as a potency difference criterion12. Figure 1 shows exemplary MMP-cliffs.

ca42754b-8bec-4663-a5ef-eb28578bc9a9_figure1.gif

Figure 1. MMP-cliffs.

Six representative MMP-cliffs for three targets belonging to different target families are shown; (a) muscarinic acetylcholine receptor M3, (b) serine/threonine-protein kinase c-TAK1, (c) matrix metalloproteinase-2. The pKi value of each compound is provided and the structural differences between cliff-forming compounds are highlighted in red.

(2) SAR transfer can be rationalized in different ways. For example, a compound series might display similar potency progression against two different targets13. Alternatively, two different compound series with corresponding analogs, i.e., series having different core structures and containing compounds with pairwise corresponding substitutions, might display similar potency progression against a given target14. Such SAR transfer series displaying similar target-specific SAR behavior are often sought after in medicinal chemistry as alternative compounds for optimization. Here we focus on these target-based SAR transfer series. Figure 2 shows an example.

ca42754b-8bec-4663-a5ef-eb28578bc9a9_figure2.gif

Figure 2. SAR transfer series.

An exemplary target-based SAR transfer series is shown. Compound pairs are arranged in the order of increasing potency (from the bottom to the top). Potency progression is monitored by corresponding pairs of color-coded dots using a continuous color spectrum from green (lowest potency value (pKi = 5.7) in the compound data set), over yellow to red (highest potency value; pKi = 9.0). The pKi value of each compound is provided. The core structures are drawn in black and the substituents in red. The compounds are active against serine/threonine-protein kinase D2.

(3) Computational generation of MMPs typically involves molecular fragmentation through the systematic deletion of exocyclic single bonds5. Hence, the resulting fragments representing a molecular core and substituent are not derived considering chemical reactions. Accordingly, a transformation relating MMP-forming compounds to each other might not necessarily be interpretable from a synthetic perspective. Hence synthetic accessibility of MMPs might be further improved by considering the reaction information during molecular fragmentation. This has been accomplished by applying the well-known retrosynthetic combinatorial analysis procedure (RECAP) rules15, leading to the introduction of RECAP-MMPs16. Representative examples are shown in Figure 3. In addition, examplary differences between standard MMPs and RECAP-MMPs are illustrated in Figure 4.

ca42754b-8bec-4663-a5ef-eb28578bc9a9_figure3.gif

Figure 3. RECAP-MMPs.

In (a)–(d), four exemplary RECAP-MMPs representing different retrosynthetic rules are shown. For each RECAP-MMP, the chemical transformation is highlighted in red.

ca42754b-8bec-4663-a5ef-eb28578bc9a9_figure4.gif

Figure 4. Standard MMPs vs. RECAP-MMPs.

Two pairs of compounds that form both standard MMPs and RECAP-MMPs are shown. For each pair, the structural differences between compounds are highlighted. The chemical transformation associated with the standard MMP is colored in red, while the transformation of the RECAP-MMP corresponds to the combination of fragments colored in red and blue.

MMP generation

For the generation of MMP-cliffs, SAR transfer series, and RECAP-MMPs, transformation size restrictions that limit transformations to meaningful chemical substitutions were introduced12. Specifically, the common core structure had to be at least twice the size of each exchanged substructure. Furthermore, the difference in size of the exchanged fragments was limited to at most eight non-hydrogen atoms and the maximal size of an exchanged fragment was set to 13 non-hydrogen atoms12. Therefore, the largest permitted transformations included, for example, the addition of a substituted ring to a compound or the replacement of a five- or six-membered ring with a substituted condensed two-ring system (with a maximum of 13 atoms). All possible transformation size-restricted MMPs and RECAP-MMPs were calculated using an in-house implementation of the algorithm by Hussain and Rea5 that utilizes the OpenEye toolkit17.

Compounds and activity data

Compound data were taken from the latest version of ChEMBL (release 17)7,8. Only compounds with direct interactions (i.e., target relationship type “D”) against human targets at the highest confidence level (target confidence score 9) were selected. Two types of potency measurements were separately considered, i.e., Ki (equilibrium constant) and IC50 (half-maximal inhibition concentration) values. In order to ensure high data confidence, inactive or inconclusive compounds and compounds with approximate measurements such as “>”, “<”, or “∼” were not considered. For compounds with multiple measurements against the same target, the geometric mean was calculated as the final potency annotation, provided that all values fell within one order of magnitude; otherwise, the compound was discarded. All qualifying compounds were further organized into target sets. A total of 661 and 1203 target sets (consisting of compounds with reported specific activity against a given target) were collected for the Ki- and IC50-based subsets, respectively, as reported in Table 1. The target sets contained a total of 45,353 and 95,685 compounds and 77,421 and 135,291 potency measurements for the Ki and IC50 subsets, respectively. These target sets provided the basis for the generation of all MMPs.

Table 1. Data sets.

Number ofKiIC50
Targets6611203
Compounds45,35395,685
Measurements77,421135,291

For the Ki and IC50 subsets from the latest version of ChEMBL (release 17), the total numbers of targets, compounds, and corresponding potency measurements are reported.

Results

As a follow-up on the original publications in which MMP-cliffs12, SAR transfer series14, and RECAP-MMPs16 were introduced, all corresponding data sets have been re-generated on the basis of ChEMBL release 17, hence providing up-to-date versions for release. Separate data subsets have been generated for different types of well-defined potency measurements (i.e., assay-dependent IC50 vs. assay-independent Ki values) to avoid inconsistencies due to simultaneous consideration of different potency measurements that cannot be directly compared.

MMP-cliffs

Figure 1 illustrates small chemical changes in compound pairs leading to large potency differences that are captured by MMP-cliffs. For ease of structural interpretation, we currently prefer MMP-based activity cliff representations compared to alternative representations that rely on calculated similarity values11. Table 2 provides the MMP-cliff statistics for the current data set. On the basis of Ki and IC50 measurements, more than 20,000 and 25,000 MMP-cliffs were obtained, respectively, requiring an at least 100-fold difference in potency between cliff-forming compounds. The MMP-cliffs corresponded to ~5% of all MMPs that were generated from ChEMBL compounds with high-confidence activity data. They covered 293 and 500 different targets on the basis of Ki and IC50 measurements, respectively. In addition to the more conservative potency difference cutoff, MMP-cliffs were also identified when a less stringent criterion was applied, i.e., two compounds forming an MMP were required to have a potency difference of at least one order of magnitude. In this case, as reported in Table 2, nearly 99,000 and more than 126,000 MMP-cliffs were detected in 392 and 726 targets for the Ki and IC50 subsets, respectively. The proportion of MMP-cliffs increased to approx. 25%.

Table 2. MMP and MMP-cliff statistics.

Number ofKiIC50
MMPs385,777537,848
Targets with MMPs467929
MMP compounds40,454 (89.2%)80,744 (84.4%)
∆Potency
≥ 1 OoM
MMP-cliffs98,608126,464
% MMP-cliffs25.6%23.5%
Targets with MMP-cliffs392726
MMP-cliff compounds29,976 (66.1%)50,413 (52.7%)
∆Potency
≥ 2 OoM
MMP-cliffs20,07325,297
% MMP-cliffs5.2%4.7%
Targets with MMP-cliffs293500
MMP-cliff compounds11,760 (25.9%)16,816 (17.6%)

For the Ki- and IC50-based compound subsets, the number of MMPs, the number of targets for which MMPs were obtained, and the number (and ratio) of compounds that formed MMPs are reported. In addition, the number and proportion of MMP-cliffs derived from all MMPs with potency difference (∆Potency) of at least one order (1 OoM) or two orders of magnitude (2 OoM) are provided, respectively, as well as the number of targets for which MMP-cliffs were obtained and the number (and ratio) of cliff-forming compounds.

SAR transfer series

SAR transfer series are best rationalised as pairs of compound series active against the same target that have distinct core structures, and consist of corresponding pairs of analogs, as illustrated in Figure 2 for a small series with three pairs. Different from the original analysis of target-based SAR transfer14 that was based upon MMPs without transformation size restrictions, the current analysis has been carried out on the basis of size-restricted MMPs. This modification further supports SAR exploration (because only small chemical changes are considered) and explains a reduction in series numbers compared to the original publication. In Table 3, the numbers of different series available for the current data set are reported. A total of 1270 and 2109 matching series were obtained from the Ki and IC50 subsets, respectively. Matching series met the structural requirement of consisting of at least three pairs of corresponding analogs. In addition, the potency values of compounds associated with individual series had to span at least two orders of magnitude. From these pre-selected matching series, 157 (Ki) and 513 (IC50) SAR transfer series with at least approximate potency progression and activity against 42 and 54 targets, respectively, were obtained. A subset of 60 (Ki) and 322 (IC50) SAR transfer series displayed strictly corresponding (regular) potency progression (often over different potency ranges)14. These series were active against 23 (Ki) and 27 (IC50) different targets. The size of SAR transfer series with approximate and regular potency progression ranged from three to 12 corresponding pairs of analogs. On average, the SAR transfer series consisted of three to four pairs.

Table 3. Target-based SAR transfer series statistics.

Number ofKiIC50
Matching series12702109
T_SAR-TS157513
Targets with T_SAR-TS4254
T_SAR-TS-RP60322
Targets with T_SAR-TS-RP2327

For the Ki and IC50 subsets, the number of qualifying matching compound series is reported. In addition, the number of target-based SAR transfer series with at least approximate potency progression (T_SAR-TS), the subset of SAR transfer series with regular potency progression (T_SAR-TS-RP), and the corresponding numbers of targets are given.

RECAP-MMPs

The replacement of systematic fragmentation of exocyclic single bonds with a set of 13 retrosynthetic rules for MMP generation reduced the number of MMPs that were obtained by more than half. RECAP-MMP numbers are reported in Table 4. However, (perhaps surprisingly) large numbers of RECAP-MMPs remained for further consideration and assessment of synthetic feasibility. From the Ki and IC50 subsets, nearly 170,000 and more than 240,000 RECAP-MMPs were obtained with activity against 371 and 778 targets, respectively. Examples are shown in Figure 3.

Table 4. RECAP-MMP statistics.

Number ofKiIC50
RECAP-MMPs169,889240,322
Targets with RECAP-MMPs371778
RECAP-MMP compounds28,529 (62.9%)53,917 (56.3%)

For the Ki and IC50 subsets, the number of RECAP-MMPs, the number of targets for which RECAP-MMPs were obtained, and the number (and ratio) of compounds that formed RECAP-MMPs are reported.

Data availability

All MMP-cliffs, SAR transfer series, and RECAP-MMPs are provided in canonical SMILES representation18 on a per-target basis separately for the Ki and IC50 subsets. The canonical SMILES representation of compounds was calculated using the Molecular Operating Environment19 on the basis of standardized molecular structures by removing solvents or ions and rebalancing protonation states. Furthermore, the canonical SMILES representation of key fragments (cores) and chemical transformations derived from MMPs and RECAP-MMPs was generated using the OpenEye toolkit17.

ZENODO: Detailed data sets of MMP-cliffs, SAR transfer series, RECAP-MMPs and compound activities, doi: 10.5281/zenodo.841820.

Summary

We have described new and up-to-date MMP-based data sets comprising activity cliffs, SAR transfer series, and second generation retrosynthetic MMPs that have been systematically generated from currently available public domain compounds with high-confidence activity data. Hence, these data sets are comprehensive and have broad target coverage. They are made available without restrictions to the scientific community to aid in SAR analysis, compound design, and other medicinal chemistry applications. It is hoped that these data sets might be of interest and useful to many investigators in this field and catalyse further research efforts.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 21 Feb 2014
Revised
Version 1
VERSION 1 PUBLISHED 04 Feb 2014
Discussion is closed on this version, please comment on the latest version above.
  • Author Response (F1000Research Advisory Board Member) 20 Feb 2014
    Jürgen Bajorath, Department of Life Science Informatics,B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, D-53113, Germany
    20 Feb 2014
    Author Response F1000Research Advisory Board Member
    Author response to Review by Shana Posy

    After completing our revision, an additional review has been obtained which we address as follows:
    1. In the files of MMP-cliffs and RECAP-MMPs, a column "NumOfCuts"
    ... Continue reading
  • Author Response (F1000Research Advisory Board Member) 18 Feb 2014
    Jürgen Bajorath, Department of Life Science Informatics,B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, D-53113, Germany
    18 Feb 2014
    Author Response F1000Research Advisory Board Member
    The article has been revised as follows in response to the reviewer comments:

    The ChEMBL version number and more technical information concerning toolkits and SMILES calculations have been added. In addition, ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Hu Y, de la Vega de León A, Zhang B and Bajorath J. Matched molecular pair-based data sets for computer-aided medicinal chemistry [version 2; peer review: 4 approved]. F1000Research 2014, 3:36 (https://doi.org/10.12688/f1000research.3-36.v2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 21 Feb 2014
Revised
Views
67
Cite
Reviewer Report 24 Feb 2014
Patrick Walters, Vertex Pharmaceuticals Inc., Cambridge, MA, USA 
Approved
VIEWS 67
This revised ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Walters P. Reviewer Report For: Matched molecular pair-based data sets for computer-aided medicinal chemistry [version 2; peer review: 4 approved]. F1000Research 2014, 3:36 (https://doi.org/10.5256/f1000research.3897.r3853)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 04 Feb 2014
Views
67
Cite
Reviewer Report 20 Feb 2014
Shana Posy, Research and Development, Bristol-Myers Squibb, Princeton, NJ, USA 
Approved
VIEWS 67
Hu et al have compiled a useful set of matched pair datasets based on the CHEMBL database of biological activity. They describe in a straightforward manner the derivation of the datasets and basic concepts relevant for matched pairs. The following ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Posy S. Reviewer Report For: Matched molecular pair-based data sets for computer-aided medicinal chemistry [version 2; peer review: 4 approved]. F1000Research 2014, 3:36 (https://doi.org/10.5256/f1000research.3753.r3733)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
59
Cite
Reviewer Report 18 Feb 2014
Patrick Walters, Vertex Pharmaceuticals Inc., Cambridge, MA, USA 
Approved
VIEWS 59
This paper provides a review of the Bajorath group's recent work on matched molecular pairs (MMP), a technique for exploring structure activity relationships, and identifying chemical transformations that can readily modulate biological activity. The authors focus on recent applications of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Walters P. Reviewer Report For: Matched molecular pair-based data sets for computer-aided medicinal chemistry [version 2; peer review: 4 approved]. F1000Research 2014, 3:36 (https://doi.org/10.5256/f1000research.3753.r3508)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
64
Cite
Reviewer Report 17 Feb 2014
Peter Ertl, Cheminformatics, Novartis Institutes for Biomedical Research, Basel, Switzerland 
Approved
VIEWS 64
The matched molecule pairs approach provides a “chemistry friendly” and intuitive way of expressing relationships among molecules and therefore this manuscript is of importance to all cheminformatics scientists interested in the study of activity cliffs, SAR analysis and in the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ertl P. Reviewer Report For: Matched molecular pair-based data sets for computer-aided medicinal chemistry [version 2; peer review: 4 approved]. F1000Research 2014, 3:36 (https://doi.org/10.5256/f1000research.3753.r3707)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
67
Cite
Reviewer Report 10 Feb 2014
Ajay Jain, Departments of Biopharmaceutical Sciences and Laboratory Medicine, University of California San Francisco, San Francisco, CA, USA 
Approved
VIEWS 67
The data set described by Hu et al. is a large set of carefully curated small molecule matched-molecular pairs (MMPs) with high-quality activity data derived from ChEMBL. The set includes examples of structure-activity cliffs, as well as matched SAR-transfer series, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jain A. Reviewer Report For: Matched molecular pair-based data sets for computer-aided medicinal chemistry [version 2; peer review: 4 approved]. F1000Research 2014, 3:36 (https://doi.org/10.5256/f1000research.3753.r3509)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 21 Feb 2014
Revised
Version 1
VERSION 1 PUBLISHED 04 Feb 2014
Discussion is closed on this version, please comment on the latest version above.
  • Author Response (F1000Research Advisory Board Member) 20 Feb 2014
    Jürgen Bajorath, Department of Life Science Informatics,B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, D-53113, Germany
    20 Feb 2014
    Author Response F1000Research Advisory Board Member
    Author response to Review by Shana Posy

    After completing our revision, an additional review has been obtained which we address as follows:
    1. In the files of MMP-cliffs and RECAP-MMPs, a column "NumOfCuts"
    ... Continue reading
  • Author Response (F1000Research Advisory Board Member) 18 Feb 2014
    Jürgen Bajorath, Department of Life Science Informatics,B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Bonn, D-53113, Germany
    18 Feb 2014
    Author Response F1000Research Advisory Board Member
    The article has been revised as follows in response to the reviewer comments:

    The ChEMBL version number and more technical information concerning toolkits and SMILES calculations have been added. In addition, ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.