Introduction
The matched molecular pair (MMP) concept is widely applied in medicinal chemistry1–4. An MMP is defined as a pair of compounds that are only distinguished by a structural modification at a single site1, i.e., the exchange of a substructure, termed a chemical transformation5. MMPs are attractive tools for computational analysis because they can be algorithmically generated and they make it possible to associate defined structural modifications at the level of compound pairs with chemical property changes, including biological activity2–4. MMPs are usually chemically intuitive and easily accessible, which helps to bridge the gap between computational analysis and the practice of medicinal chemistry.
In the context of different studies, we have systematically generated MMPs through the mining of publicly available compound activity data. All possible MMPs have been derived from compounds active against currently available pharmaceutical targets. Then, MMPs have been used to explore structure-activity relationships (SARs) on a large-scale and from different viewpoints.
In a previous data article, we have reported and made publicly available a number of different data sets and computational tools developed in our laboratory6. Here we describe three recently developed MMP-based data structures, which should be of interest for SAR analysis and compound design, and we also provide up-to-date versions of the corresponding data sets. It is anticipated that these data sets will be helpful as a resource for computer-aided medicinal chemistry applications. The data sets include MMP-based activity cliffs (i.e., MMP-cliffs), SAR transfer series, and MMPs derived on the basis of retrosynthetic fragmentation rules and were derived from all bioactive compounds currently available in the ChEMBL database (release 17)7,8. Only high-confidence activity data (as specified below) were considered. MMP-cliffs, SAR transfer series, and retrosynthetic MMPs provide comprehensive sources of SAR information. In addition, retrosynthetic MMPs are thought to increase the utility of computational MMP analysis for practical chemistry efforts because these second generation MMPs consider reaction information during molecular fragmentation, which sets them apart from standard MMPs originating from systematic fragmentation of all possible exocyclic single bonds in a molecule (as detailed below).
Materials and methods
Concepts
(1) Activity cliffs are generally defined as pairs or groups of compounds that are structurally similar and have large differences in potency9–11. Accordingly, activity cliffs usually have high SAR information content (because small chemical changes in similar or analogous compounds lead to large potency effects). The assessment of activity cliffs requires clearly defined similarity and potency difference criteria9–11. The formation of an MMP can be considered as a similarity criterion, which is similarity metric-free and often chemically more intuitive than the use of calculated molecular similarity11,12. MMP formation as a similarity criterion has led to the introduction of MMP-cliffs12. For MMP-cliffs, a difference in potency of at least two orders of magnitude between cliff-forming compounds was set as a potency difference criterion12. Figure 1 shows exemplary MMP-cliffs.

Figure 1. MMP-cliffs.
Six representative MMP-cliffs for three targets belonging to different target families are shown; (a) muscarinic acetylcholine receptor M3, (b) serine/threonine-protein kinase c-TAK1, (c) matrix metalloproteinase-2. The pKi value of each compound is provided and the structural differences between cliff-forming compounds are highlighted in red.
(2) SAR transfer can be rationalized in different ways. For example, a compound series might display similar potency progression against two different targets13. Alternatively, two different compound series with corresponding analogs, i.e., series having different core structures and containing compounds with pairwise corresponding substitutions, might display similar potency progression against a given target14. Such SAR transfer series displaying similar target-specific SAR behavior are often sought after in medicinal chemistry as alternative compounds for optimization. Here we focus on these target-based SAR transfer series. Figure 2 shows an example.

Figure 2. SAR transfer series.
An exemplary target-based SAR transfer series is shown. Compound pairs are arranged in the order of increasing potency (from the bottom to the top). Potency progression is monitored by corresponding pairs of color-coded dots using a continuous color spectrum from green (lowest potency value (pKi = 5.7) in the compound data set), over yellow to red (highest potency value; pKi = 9.0). The pKi value of each compound is provided. The core structures are drawn in black and the substituents in red. The compounds are active against serine/threonine-protein kinase D2.
(3) Computational generation of MMPs typically involves molecular fragmentation through the systematic deletion of exocyclic single bonds5. Hence, the resulting fragments representing a molecular core and substituent are not derived considering chemical reactions. Accordingly, a transformation relating MMP-forming compounds to each other might not necessarily be interpretable from a synthetic perspective. Hence synthetic accessibility of MMPs might be further improved by considering the reaction information during molecular fragmentation. This has been accomplished by applying the well-known retrosynthetic combinatorial analysis procedure (RECAP) rules15, leading to the introduction of RECAP-MMPs16. Representative examples are shown in Figure 3. In addition, examplary differences between standard MMPs and RECAP-MMPs are illustrated in Figure 4.

Figure 3. RECAP-MMPs.
In (a)–(d), four exemplary RECAP-MMPs representing different retrosynthetic rules are shown. For each RECAP-MMP, the chemical transformation is highlighted in red.

Figure 4. Standard MMPs vs. RECAP-MMPs.
Two pairs of compounds that form both standard MMPs and RECAP-MMPs are shown. For each pair, the structural differences between compounds are highlighted. The chemical transformation associated with the standard MMP is colored in red, while the transformation of the RECAP-MMP corresponds to the combination of fragments colored in red and blue.
MMP generation
For the generation of MMP-cliffs, SAR transfer series, and RECAP-MMPs, transformation size restrictions that limit transformations to meaningful chemical substitutions were introduced12. Specifically, the common core structure had to be at least twice the size of each exchanged substructure. Furthermore, the difference in size of the exchanged fragments was limited to at most eight non-hydrogen atoms and the maximal size of an exchanged fragment was set to 13 non-hydrogen atoms12. Therefore, the largest permitted transformations included, for example, the addition of a substituted ring to a compound or the replacement of a five- or six-membered ring with a substituted condensed two-ring system (with a maximum of 13 atoms). All possible transformation size-restricted MMPs and RECAP-MMPs were calculated using an in-house implementation of the algorithm by Hussain and Rea5 that utilizes the OpenEye toolkit17.
Compounds and activity data
Compound data were taken from the latest version of ChEMBL (release 17)7,8. Only compounds with direct interactions (i.e., target relationship type “D”) against human targets at the highest confidence level (target confidence score 9) were selected. Two types of potency measurements were separately considered, i.e., Ki (equilibrium constant) and IC50 (half-maximal inhibition concentration) values. In order to ensure high data confidence, inactive or inconclusive compounds and compounds with approximate measurements such as “>”, “<”, or “∼” were not considered. For compounds with multiple measurements against the same target, the geometric mean was calculated as the final potency annotation, provided that all values fell within one order of magnitude; otherwise, the compound was discarded. All qualifying compounds were further organized into target sets. A total of 661 and 1203 target sets (consisting of compounds with reported specific activity against a given target) were collected for the Ki- and IC50-based subsets, respectively, as reported in Table 1. The target sets contained a total of 45,353 and 95,685 compounds and 77,421 and 135,291 potency measurements for the Ki and IC50 subsets, respectively. These target sets provided the basis for the generation of all MMPs.
Table 1. Data sets.
Number of | Ki | IC50 |
---|
Targets | 661 | 1203 |
Compounds | 45,353 | 95,685 |
Measurements | 77,421 | 135,291 |
Results
As a follow-up on the original publications in which MMP-cliffs12, SAR transfer series14, and RECAP-MMPs16 were introduced, all corresponding data sets have been re-generated on the basis of ChEMBL release 17, hence providing up-to-date versions for release. Separate data subsets have been generated for different types of well-defined potency measurements (i.e., assay-dependent IC50 vs. assay-independent Ki values) to avoid inconsistencies due to simultaneous consideration of different potency measurements that cannot be directly compared.
MMP-cliffs
Figure 1 illustrates small chemical changes in compound pairs leading to large potency differences that are captured by MMP-cliffs. For ease of structural interpretation, we currently prefer MMP-based activity cliff representations compared to alternative representations that rely on calculated similarity values11. Table 2 provides the MMP-cliff statistics for the current data set. On the basis of Ki and IC50 measurements, more than 20,000 and 25,000 MMP-cliffs were obtained, respectively, requiring an at least 100-fold difference in potency between cliff-forming compounds. The MMP-cliffs corresponded to ~5% of all MMPs that were generated from ChEMBL compounds with high-confidence activity data. They covered 293 and 500 different targets on the basis of Ki and IC50 measurements, respectively. In addition to the more conservative potency difference cutoff, MMP-cliffs were also identified when a less stringent criterion was applied, i.e., two compounds forming an MMP were required to have a potency difference of at least one order of magnitude. In this case, as reported in Table 2, nearly 99,000 and more than 126,000 MMP-cliffs were detected in 392 and 726 targets for the Ki and IC50 subsets, respectively. The proportion of MMP-cliffs increased to approx. 25%.
Table 2. MMP and MMP-cliff statistics.
Number of | Ki | IC50
|
---|
MMPs | 385,777 | 537,848 |
Targets with MMPs | 467 | 929 |
MMP compounds | 40,454 (89.2%) | 80,744 (84.4%) |
∆Potency ≥ 1 OoM | MMP-cliffs | 98,608 | 126,464 |
% MMP-cliffs | 25.6% | 23.5% |
Targets with MMP-cliffs | 392 | 726 |
MMP-cliff compounds | 29,976 (66.1%) | 50,413 (52.7%) |
∆Potency ≥ 2 OoM | MMP-cliffs | 20,073 | 25,297 |
% MMP-cliffs | 5.2% | 4.7% |
Targets with MMP-cliffs | 293 | 500 |
MMP-cliff compounds | 11,760 (25.9%) | 16,816 (17.6%) |
SAR transfer series
SAR transfer series are best rationalised as pairs of compound series active against the same target that have distinct core structures, and consist of corresponding pairs of analogs, as illustrated in Figure 2 for a small series with three pairs. Different from the original analysis of target-based SAR transfer14 that was based upon MMPs without transformation size restrictions, the current analysis has been carried out on the basis of size-restricted MMPs. This modification further supports SAR exploration (because only small chemical changes are considered) and explains a reduction in series numbers compared to the original publication. In Table 3, the numbers of different series available for the current data set are reported. A total of 1270 and 2109 matching series were obtained from the Ki and IC50 subsets, respectively. Matching series met the structural requirement of consisting of at least three pairs of corresponding analogs. In addition, the potency values of compounds associated with individual series had to span at least two orders of magnitude. From these pre-selected matching series, 157 (Ki) and 513 (IC50) SAR transfer series with at least approximate potency progression and activity against 42 and 54 targets, respectively, were obtained. A subset of 60 (Ki) and 322 (IC50) SAR transfer series displayed strictly corresponding (regular) potency progression (often over different potency ranges)14. These series were active against 23 (Ki) and 27 (IC50) different targets. The size of SAR transfer series with approximate and regular potency progression ranged from three to 12 corresponding pairs of analogs. On average, the SAR transfer series consisted of three to four pairs.
Table 3. Target-based SAR transfer series statistics.
Number of | Ki | IC50 |
---|
Matching series | 1270 | 2109 |
T_SAR-TS | 157 | 513 |
Targets with T_SAR-TS | 42 | 54 |
T_SAR-TS-RP | 60 | 322 |
Targets with T_SAR-TS-RP | 23 | 27 |
RECAP-MMPs
The replacement of systematic fragmentation of exocyclic single bonds with a set of 13 retrosynthetic rules for MMP generation reduced the number of MMPs that were obtained by more than half. RECAP-MMP numbers are reported in Table 4. However, (perhaps surprisingly) large numbers of RECAP-MMPs remained for further consideration and assessment of synthetic feasibility. From the Ki and IC50 subsets, nearly 170,000 and more than 240,000 RECAP-MMPs were obtained with activity against 371 and 778 targets, respectively. Examples are shown in Figure 3.
Table 4. RECAP-MMP statistics.
Number of | Ki | IC50 |
---|
RECAP-MMPs | 169,889 | 240,322 |
Targets with RECAP-MMPs | 371 | 778 |
RECAP-MMP compounds | 28,529 (62.9%) | 53,917 (56.3%) |
Data availability
All MMP-cliffs, SAR transfer series, and RECAP-MMPs are provided in canonical SMILES representation18 on a per-target basis separately for the Ki and IC50 subsets. The canonical SMILES representation of compounds was calculated using the Molecular Operating Environment19 on the basis of standardized molecular structures by removing solvents or ions and rebalancing protonation states. Furthermore, the canonical SMILES representation of key fragments (cores) and chemical transformations derived from MMPs and RECAP-MMPs was generated using the OpenEye toolkit17.
ZENODO: Detailed data sets of MMP-cliffs, SAR transfer series, RECAP-MMPs and compound activities, doi: 10.5281/zenodo.841820.
Summary
We have described new and up-to-date MMP-based data sets comprising activity cliffs, SAR transfer series, and second generation retrosynthetic MMPs that have been systematically generated from currently available public domain compounds with high-confidence activity data. Hence, these data sets are comprehensive and have broad target coverage. They are made available without restrictions to the scientific community to aid in SAR analysis, compound design, and other medicinal chemistry applications. It is hoped that these data sets might be of interest and useful to many investigators in this field and catalyse further research efforts.
Author contributions
JB designed the study, YH, AVL, and BZ collected and organized the data, JB and YH wrote the manuscript, all authors examined the manuscript and agreed to the final content.
Competing interests
No competing interests were disclosed.
Grant information
The author(s) declared that no grants were involved in supporting this work.
Acknowledgements
We thank OpenEye Scientific Software, Inc., for the free academic license of the OpenEye Toolkits.
Faculty Opinions recommendedReferences
- 1.
Kenny PW, Sadowski J:
Structure modification in chemical databases. In Chemoinformatics in Drug Discovery; Oprea, T. I., Ed.; Wiley-VCH: Weinheim, Germany, 2005; 271–285. Publisher Full Text
- 2.
Griffen E, Leach AG, Robb GR, et al.:
Matched molecular pairs as a medicinal chemistry tool.
J Med Chem.
2011; 54(22): 7739–7750. PubMed Abstract
| Publisher Full Text
- 3.
Wassermann AM, Dimova D, Iyer P, et al.:
Advances in computational medicinal chemistry: matched molecular pair analysis.
Drug Dev Res.
2012; 73(8): 518–527. Publisher Full Text
- 4.
Dossetter AG, Griffen EJ, Leach AG:
Matched molecular pair analysis in drug discovery.
Drug Discov Today.
2013; 18(15–16): 724–731. PubMed Abstract
| Publisher Full Text
- 5.
Hussain J, Rea C:
Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets.
J Chem Inf Model.
2010; 50(3): 339–348. PubMed Abstract
| Publisher Full Text
- 6.
Hu Y, Bajorath J:
Freely available compound data sets and software tools for chemoinformatics and computational medicinal chemistry applications [v1; ref status: indexed, http://f1000r.es/Mu9krs].
F1000Res.
2012; 1: 11. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 7.
Gaulton A, Bellis LJ, Bento AP, et al.:
ChEMBL: a large-scale bioactivity database for drug discovery.
Nucleic Acids Res.
2012; 40(Database issue): D1100–D1107. PubMed Abstract
| Publisher Full Text
| Free Full Text
- 8.
Bento AP, Gaulton A, Hersey A, et al.:
The ChEMBL bioactivity database: an update.
Nucleic Acids Res.
2014; 42(1): D1083–D1090. PubMed Abstract
| Publisher Full Text
- 9.
Stumpfe D, Bajorath J:
Exploring activity cliffs in medicinal chemistry.
J Med Chem.
2012; 55(7): 2932–2942. PubMed Abstract
| Publisher Full Text
- 10.
Stumpfe D, Hu Y, Dimova D, et al.:
Recent progress in understanding activity cliffs and their utility in medicinal chemistry.
J Med Chem.
2014; 57(1): 18–28. PubMed Abstract
| Publisher Full Text
- 11.
Hu Y, Stumpfe D, Bajorath J:
Advancing the activity cliff concept [v1; ref status: indexed, http://f1000r.es/1wf].
F1000Res.
2013; 2: 199. Publisher Full Text
| Free Full Text
- 12.
Hu X, Hu Y, Vogt M, et al.:
MMP-Cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs.
J Chem Inf Model.
2012; 52(5): 1138–1145. PubMed Abstract
| Publisher Full Text
- 13.
Zhang B, Hu Y, Bajorath J:
SAR transfer across different targets.
J Chem Inf Model.
2013; 53(7): 1589–1594. PubMed Abstract
| Publisher Full Text
- 14.
Zhang B, Wassermann AM, Vogt M, et al.:
Systematic assessment of compound series with SAR transfer potential.
J Chem Inf Model.
2012; 52(12): 3138–3143. PubMed Abstract
| Publisher Full Text
- 15.
Lewell XQ, Judd DB, Watson SP, et al.:
RECAP--retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry.
J Chem Inf Comput Sci.
1998; 38(3): 511–522. PubMed Abstract
| Publisher Full Text
- 16.
de la Vega de León A, Bajorath J:
Matched molecular pairs derived by retrosynthetic fragmentation.
Med Chem Commun.
2014; 5(1): 64–67. Publisher Full Text
- 17.
OEChem, version 1.7.7, OpenEye Scientific Software, Inc., Santa Fe, NM, USA.
2012. Reference Source
- 18.
Weininger D:
SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules.
J Chem Inf Comput Sci.
1988; 28(1): 31–36. Publisher Full Text
- 19.
Molecular Operating Environment (MOE), 2011.10; Chemical Computing Group Inc., 1010 Sherbooke St. West, Suite#910, Montreal, QC, Canada, H3A 2R7, 2011. Reference Source
- 20.
Hu Y, de la Vega de León A, Zhang B, et al.:
Detailed data sets of MMP-cliffs, SAR transfer series, RECAP-MMPs and compound activities. 2014. Data Source
After completing our revision, an additional review has been obtained which we address as follows:
- In the files of MMP-cliffs and RECAP-MMPs, a column "NumOfCuts"
... Continue reading Author response to Review by Shana PosyAfter completing our revision, an additional review has been obtained which we address as follows:
Single cut indicates that the chemical modification maps to the termini of a molecule, whereas double and triple cuts indicate that the structural changes are at internal parts. It should be noted that changes at termini do not necessarily mean R-group variation (e.g., Figure 3b) and that changes of internal parts do not necessarily mean core scaffold replacement (e.g., Figure 3c).
In response to the reviewer comments, the data sets have been updated.
We thank the reviewer for the comments
After completing our revision, an additional review has been obtained which we address as follows:
Single cut indicates that the chemical modification maps to the termini of a molecule, whereas double and triple cuts indicate that the structural changes are at internal parts. It should be noted that changes at termini do not necessarily mean R-group variation (e.g., Figure 3b) and that changes of internal parts do not necessarily mean core scaffold replacement (e.g., Figure 3c).
In response to the reviewer comments, the data sets have been updated.
We thank the reviewer for the comments