<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.8357.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                    <subj-group>
                        <subject>Drug Discovery &amp; Design</subject>
                    </subj-group>
                    <subj-group>
                        <subject>Macromolecular Chemistry</subject>
                    </subj-group>
                    <subj-group>
                        <subject>Small Molecule Chemistry</subject>
                    </subj-group>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Activity-relevant similarity values for fingerprints and implications for similarity searching</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 3 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Jasial</surname>
                        <given-names>Swarit</given-names>
                    </name>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Hu</surname>
                        <given-names>Ye</given-names>
                    </name>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Vogt</surname>
                        <given-names>Martin</given-names>
                    </name>
                    <uri content-type="orcid">https://orcid.org/0000-0002-3931-9516</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Bajorath</surname>
                        <given-names>J&#x00fc;rgen</given-names>
                    </name>
                    <uri content-type="orcid">https://orcid.org/0000-0002-0557-5714</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universit&#x00e4;t, Bonn, Germany</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:bajorath@bit.uni-bonn.de">bajorath@bit.uni-bonn.de</email>
                </corresp>
                <fn fn-type="con">
                    <p>JB conceived the study; SJ and YH carried out the computational analysis; SJ, YH, MV, and JB analyzed the data and planned follow-up experiments; JB wrote the manuscript.</p>
                </fn>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>28</day>
                <month>4</month>
                <year>2016</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2016</year>
            </pub-date>
            <volume>5</volume>
            <elocation-id>Chem Inf Sci-591</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>22</day>
                    <month>4</month>
                    <year>2016</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Jasial S et al.</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/5-591/pdf"/>
            <abstract>
                <p>A largely unsolved problem in chemoinformatics is the issue of how calculated compound similarity relates to activity similarity, which is central to many applications. In general, activity relationships are predicted from calculated similarity values. However, there is no solid scientific foundation to bridge between calculated molecular and observed activity similarity. Accordingly, the success rate of identifying new active compounds by similarity searching is limited. Although various attempts have been made to establish relationships between calculated fingerprint similarity values and biological activities, none of these has yielded generally applicable rules for similarity searching. In this study, we have addressed the question of molecular versus activity similarity in a more fundamental way. First, we have evaluated if activity-relevant similarity value ranges could in principle be identified for standard fingerprints and distinguished from similarity resulting from random compound comparisons. Then, we have analyzed if activity-relevant similarity values could be used to guide typical similarity search calculations aiming to identify active compounds in databases. It was found that activity-relevant similarity values can be identified as a characteristic feature of fingerprints. However, it was also shown that such values cannot be reliably used as thresholds for practical similarity search calculations. In addition, the analysis presented herein helped to rationalize differences in fingerprint search performance.</p>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Bioactive compounds</kwd>
                <kwd>molecular similarity</kwd>
                <kwd>similarity-property principle</kwd>
                <kwd>similarity searching</kwd>
                <kwd>fingerprints</kwd>
                <kwd>Tanimoto coefficient</kwd>
                <kwd>activity similarity</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>In response to the referees' comments, we have revised the manuscript as follows. Georgia McGaughey: A new Figure 3 has been added and the potency threshold discussed. Wendy Warr: Several references have been updated and a number of stylistic adjustments made. Peter Ertl: Fingerprint calculation details and a reference have been added.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec sec-type="intro">
            <title>Introduction</title>
            <p>Calculation of molecular similarity is a central task in chemoinformatics
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-4">4</xref>
                </sup> for which a variety of methods, chemical descriptors, and similarity measures have been introduced
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-7">7</xref>
                </sup>. A key aspect of the molecular similarity concept is that one often attempts to extrapolate from calculated similarity to activity similarity. In other words, it is assumed that increasing chemical similarity correlates with an increasing likelihood that two compounds share the same activity, in accord with the similarity-property principle (&#x201c;similar compounds should have similar properties&#x201d;)
                <sup>
                    <xref ref-type="bibr" rid="ref-1">1</xref>
                </sup>; a major foundation of chemoinformatics. A methodological consequence of the molecular similarity concept and similarity-property principle was the introduction of similarity searching for active compounds
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-6">6</xref>
                </sup>. Here, similarity values are calculated for known active reference and database compounds, which are then ranked in the order of decreasing similarity to the reference(s). Classical molecular descriptors for these search calculations include 2D-fingerprints, i.e. bit string representations of chemical structures and/or properties derived from molecular graphs
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-8">8</xref>,
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup>. The overlap between fingerprints is quantified as a measure of molecular similarity using metrics such as the Tanimoto coefficient (Tc)
                <sup>
                    <xref ref-type="bibr" rid="ref-2">2</xref>,
                    <xref ref-type="bibr" rid="ref-7">7</xref>
                </sup>; the gold standard in the chemoinformatics field. For a pair of compounds represented by fingerprints, its Tc value is calculated as the ratio between the number of features conserved in both fingerprints and the number of features present in either fingerprint. Accordingly, the Tc is a numerical measure of similarity ranging from zero (no fingerprint overlap) to one (fingerprint identity).</p>
            <p>Similarity search calculations exemplify the similarity conundrum in chemoinformatics: the ultimate goal is the identification of new active compounds on the basis of similarity, but activity information is not used as a search parameter. It has been shown that generally applicable Tc threshold values as an indicator of activity similarity do not exist
                <sup>
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup>. This is the case because similarity value distributions are compound class- and fingerprint-dependent. To further complicate matters, it has also been shown that 2D-fingerprints successfully detect structurally diverse active compounds at varying similarity levels
                <sup>
                    <xref ref-type="bibr" rid="ref-9">9</xref>,
                    <xref ref-type="bibr" rid="ref-10">10</xref>
                </sup>. In general, a continuum of similarity values is produced that may or may not indicate activity similarity, depending on the characteristics of active compounds.</p>
            <p>A limited number of attempts have been made to associate calculated similarity with observed activity similarity. For example, early investigations of compound clustering, molecular diversity, and chemical neighborhood behavior using fingerprints have indicated that, on average, 85% of compounds that yielded a Tc value of 0.85 compared to a known active molecule were also active
                <sup>
                    <xref ref-type="bibr" rid="ref-11">11</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref-13">13</xref>
                </sup>. These findings were based on MACCS keys
                <sup>
                    <xref ref-type="bibr" rid="ref-14">14</xref>
                </sup>, a classical fingerprint in chemoinformatics consisting of a dictionary of 166 structural fragments, as well as UNITY fingerprints
                <sup>
                    <xref ref-type="bibr" rid="ref-13">13</xref>
                </sup> that assemble atom pathways of pre-defined lengths. However, using connectivity pathway fingerprints of different design and biological screening data to analyze the relationship between calculated similarity and observed activity similarity, it was concluded that there was only a likelihood of 30% that compounds yielding a Tc value of at least 0.85 shared the same activity
                <sup>
                    <xref ref-type="bibr" rid="ref-15">15</xref>
                </sup>. In addition, Kullback-Leibler divergence analysis from information theory and Bayesian modeling were combined
                <sup>
                    <xref ref-type="bibr" rid="ref-16">16</xref>
                </sup> to predict the recall of active compounds from fingerprint similarity searching and a conditional correlated Bernoulli model of similarity value distributions was developed
                <sup>
                    <xref ref-type="bibr" rid="ref-17">17</xref>
                </sup> to predict database rankings. Furthermore, belief theory was applied to empirically derive probabilistic relationships between calculated similarity and activity on the basis of similarity search benchmark calculations
                <sup>
                    <xref ref-type="bibr" rid="ref-18">18</xref>
                </sup>. In this study, MACCS keys and extended connectivity fingerprints (ECFPs)
                <sup>
                    <xref ref-type="bibr" rid="ref-19">19</xref>
                </sup> were used, among others. ECFPs capture layered atom environments in compounds up to a pre-determined bond diameter. When different fingerprints were compared in benchmark calculations ECFPs often yielded highest similarity search performance
                <sup>
                    <xref ref-type="bibr" rid="ref-8">8</xref>,
                    <xref ref-type="bibr" rid="ref-9">9</xref>
                </sup>. On the basis of probability assignment curves that related activity and similarity values for pairs of compounds to each other, it was shown, for example, that at a Tc value of 0.85 calculated with atom pathway fingerprints, ~30% of detected compound pairs shared the same activity
                <sup>
                    <xref ref-type="bibr" rid="ref-18">18</xref>
                </sup>, consistent with earlier observations
                <sup>
                    <xref ref-type="bibr" rid="ref-15">15</xref>
                </sup>. For an ECFP with bond diameter 6 (ECFP6), a Tc threshold of 0.42 yielded comparable results
                <sup>
                    <xref ref-type="bibr" rid="ref-18">18</xref>
                </sup>.</p>
            <p>Other types of fingerprints were generated exclusively on the basis of experimental activity observations, e.g. activities measured in panels of screening assays
                <sup>
                    <xref ref-type="bibr" rid="ref-20">20</xref>
                </sup>, or by combining chemical and biological criteria
                <sup>
                    <xref ref-type="bibr" rid="ref-21">21</xref>
                </sup>. These studies departed from the conceptual framework of the similarity-property principle by using activity data as descriptors and -completely or partly- circumventing similarity calculations on the basis of molecular structures.</p>
            <p>Herein, we report an analysis designed to rationalize similarity searching on the basis of different molecular comparisons, carried out on a large scale, and determine similarity values across different compound activity classes. It is shown that similarity value ranges indicative of activity can be identified for different fingerprints. However, it is also shown that such similarity values cannot be reliably used as thresholds for similarity searching, given the ratio of different molecular comparisons that are involved.</p>
        </sec>
        <sec sec-type="materials | methods">
            <title>Materials and methods</title>
            <sec>
                <title>Compound classes</title>
                <p>In a previous large-scale similarity search analysis of the ChEMBL database
                    <sup>
                        <xref ref-type="bibr" rid="ref-22">22</xref>
                    </sup>, a variety of activity classes were identified for benchmarking that were &#x201c;easy&#x201d; (i.e. yielded generally high compound recall using different fingerprints), &#x201c;preferred/intermediate&#x201d; (moderate compound recall), or &#x201c;difficult&#x201d; (low compound recall)
                    <sup>
                        <xref ref-type="bibr" rid="ref-23">23</xref>
                    </sup>. For our analysis, we have made use of this classification scheme and extracted these activity classes from ChEMBL version 20 if they contained at least 50 compounds with high-confidence assay data for human targets
                    <sup>
                        <xref ref-type="bibr" rid="ref-24">24</xref>
                    </sup> and a potency of at least 10 &#x00b5;M. It should be noted that fingerprint searching does not take activity as a parameter into account. Therefore, the potency threshold value was only applied to exclude compounds from the calculations whose activity would be considered borderline, despite the presence of high-confidence activity data. In addition, a random sample of 10,000 compounds was drawn from ZINC
                    <sup>
                        <xref ref-type="bibr" rid="ref-25">25</xref>
                    </sup> representing assumed inactive database compounds. All randomly selected (&#x201c;random&#x201d;) compounds had a molecular weight of less than 550 Da. Accordingly, all compounds with a molecular weight exceeding 550 Da were also removed from activity classes, thus balancing the potential of molecular size effects in similarity searching
                    <sup>
                        <xref ref-type="bibr" rid="ref-26">26</xref>
                    </sup>.</p>
                <p>On the basis of these criteria, 22 easy, 50 intermediate, and 30 difficult activity classes were obtained covering a wide range of targets. Easy activity classes contained a total of 2967 compounds with, on average, 135 compounds per target; intermediate activity classes contained 25,175 compounds with a mean of 504 per target, and difficult activity classes 47,109 compounds with a mean of 1570 per target. The molecular weight distributions of compounds from all categories are reported in 
                    <xref ref-type="fig" rid="f1">Figure 1</xref>. Compounds from ZINC had overall slightly lower weight than active compounds but the distributions of molecular weights of from different activity class categories were very similar.</p>
                <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                    <label>Figure 1. </label>
                    <caption>
                        <title>Molecular weight ranges.</title>
                        <p>Density plots report the molecular weight (Da) distributions of compounds in all categories.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/9296/d65f1c63-2da3-4d43-a7ac-d40be6f7c355_figure1.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Similarity calculations</title>
                <p>Two standard fingerprints of different design were used including MACCS and ECFP with bond diameter 4 (ECFP4). These fingerprint representations were calculated using an in-house script. For MACCS, settings from RDKit
                    <sup>
                        <xref ref-type="bibr" rid="ref-27">27</xref>
                    </sup> were used. For ECFP4, the original design was re-implemented
                    <sup>
                        <xref ref-type="bibr" rid="ref-19">19</xref>
                    </sup>. As a similarity metric, the Tc was calculated.</p>
                <p>Systematic pairwise similarity calculations were carried out for all individual activity classes (active vs. active), the random category (random vs. random), and active vs. random compounds.</p>
            </sec>
        </sec>
        <sec sec-type="results | discussion">
            <title>Results and discussion</title>
            <sec>
                <title>Comparison of compounds belonging to different categories</title>
                <p>During similarity searching, active reference compounds are compared to inactive database compounds or desired compounds having the same activity (&#x201c;hits&#x201d;). Thus, similarity searching can be mimicked by systematically comparing active compounds having the same activity with each other and active compounds to random database compounds. Comparison of random database compounds with each other is not carried out during traditional similarity searching, but the similarity value distribution resulting from this comparison can be monitored as an additional reference.</p>
            </sec>
            <sec>
                <title>Distribution of combined similarity values</title>
                <p>
                    <xref ref-type="fig" rid="f2">Figure 2</xref> shows the distribution of Tc values from active vs. active, random vs. active, and random vs. random compound comparisons using MACCS and ECFP4. For this comparison, Tc values obtained for all activity classes were combined. Thus, the resulting distribution represented ~55 million Tc values for compounds active against 102 targets. Comparison of ZINC compounds yielded 50 million Tc values. Their distribution was regarded to represent global chemical similarity (although many ZINC compounds are considered &#x201c;drug-like&#x201d;) and thus termed &#x201c;chemical similarity distribution&#x201d;.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>Figure 2. </label>
                    <caption>
                        <title>Distribution of combined similarity values.</title>
                        <p>Density plots of Tc values are shown for similarity comparison using (
                            <bold>a</bold>) MACCS and (
                            <bold>b</bold>) ECFP4. Compared were active compounds in each activity class (Act vs Act, purple), 10,000 random ZINC compounds with each activity class (Rand vs Act, maroon), and 10,000 random compounds (Rand vs Rand, green). Similarity values of all 102 activity classes were combined. Dashed vertical lines indicate the means of the distributions.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/9296/d65f1c63-2da3-4d43-a7ac-d40be6f7c355_figure2.gif"/>
                </fig>
                <p>
                    <xref ref-type="fig" rid="f2">Figure 2a</xref> shows that global similarity values calculated using MACCS yielded a normal distribution, given its very large sample size, which was centered on a MACCS Tc value of 0.4. The comparison of random vs. active compounds using MACCS, resulting in a total of ~753 million Tc values, 50 million of which were randomly sampled for the generation of density plots, produced a nearly identical normal distribution, also reflecting randomness. By contrast, the distribution of Tc values from active compounds, albeit significantly overlapping with the reference distributions, was shifted to the right, centered on a MACCS Tc value of 0.47. This distribution was regarded to represent activity-relevant similarity, given that it originated from more than 100 qualifying activity classes.</p>
                <p>
                    <xref ref-type="fig" rid="f2">Figure 2b</xref> shows that calculations using ECFP4 produced very different Tc value distributions. Compared to MACCS, ECFP4 Tc value distributions were shifted towards much lower Tc values and confined to small value ranges mostly falling within the interval [0.0, 0.2] (which should be known to similarity search practitioners). The chemical similarity distribution and random vs. active distribution were centered on an ECFP4 Tc value of 0.11. Also in this case, a slight shift of the activity-relevant distribution towards higher values was observed, centered on an ECFP4 Tc of 0.15. Hence, for both ZINC and ChEMBL compounds, ECFP4 calculations mostly covered only small Tc value ranges.</p>
                <p>It should also be noted that the distribution of similarity values of compounds sharing the same activity often covered a wide range, as illustrated in 
                    <xref ref-type="fig" rid="f3">Figure 3</xref>. Thus, many activity classes were structurally diverse.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>Figure 3. </label>
                    <caption>
                        <title>Pairs of active compounds with varying similarity values.</title>
                        <p>Shown are five exemplary compounds active against the voltage-gated T-type calcium channel alpha-1H subunit (an &#x201c;easy&#x201d; activity class). Five pairwise similarity values calculated using MACCS (red) and ECFP4 (blue), respectively, are reported.</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/9296/d65f1c63-2da3-4d43-a7ac-d40be6f7c355_figure3.gif"/>
                </fig>
            </sec>
            <sec>
                <title>Distributions for different activity class categories</title>
                <p>
                    <xref ref-type="fig" rid="f4">Figure 4</xref> shows corresponding Tc value distributions that were separately generated for easy, intermediate, and difficult activity classes. Comparison of these value distributions nicely correlated with the different similarity search performance observed for these  activity classes.</p>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>Figure 4. </label>
                    <caption>
                        <title>Distribution of similarity values for different activity class categories.</title>
                        <p>Density plots of Tc values are shown for similarity comparison using (
                            <bold>a</bold>) MACCS and (
                            <bold>b</bold>) ECFP4 according to 
                            <xref ref-type="fig" rid="f2">Figure 2</xref>. In this case, Tc value distributions were separately recorded for each  activity class category (easy, intermediate, and difficult). In addition, easy  activity classes were compared to a reduced random set of 1000 ZINC compounds (reported in the upper right panels).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/9296/d65f1c63-2da3-4d43-a7ac-d40be6f7c355_figure4.gif"/>
                </fig>
                <p>In 
                    <xref ref-type="fig" rid="f4">Figure 4a</xref>, the chemical similarity and random vs. active distributions were again essentially identical and centered on a MACCS Tc value of 0.4, although the sample sizes of active compounds were smaller in this case. The random vs. active distribution did not change when the random sample was reduced in size from 10,000 to 1000 ZINC compounds, indicating that the distribution was stable. Equivalent observations were made for the distributions of ECFP4 values shown in 
                    <xref ref-type="fig" rid="f4">Figure 4b</xref>
                    <bold>.</bold>
                </p>
                <p>However, for both MACCS and ECFP4, gradual shifts in the distributions of Tc values for different activity class categories were observed. From difficult over intermediate to easy activity classes, the activity-relevant distributions shifted towards higher Tc values. Thus, corresponding to increasing similarity search performance, the comparison of active compounds produced higher Tc values than random vs. active comparisons, leading to an enrichment of active compounds at higher positions in similarity-based rankings. For easy activity classes, the shape of the distributions departed from normal distributions and became multi-modal, probably reflecting activity class-dependent differences in Tc values. These distributions displayed a significant shift towards higher Tc values with a mean of 0.6 and 0.28 for MACCS and ECFP4, respectively.</p>
            </sec>
            <sec>
                <title>Activity-relevant similarity</title>
                <p>Comparison of the distributions in 
                    <xref ref-type="fig" rid="f4">Figure 4</xref> made it possible to delineate activity-relevant similarity value ranges. In 
                    <xref ref-type="fig" rid="f4">Figure 4a</xref>, the chemical similarity and random vs. active distributions for MACCS matched the baseline at a value of ~0.8, whereas a significant proportion of Tc values of 0.8 or greater were observed for comparisons of active compounds, especially for easy  activity classes. Equivalent observations were made for an ECFP4 Tc value of ~0.3 shown in 
                    <xref ref-type="fig" rid="f4">Figure 4b</xref>. Thus, for MACCS and ECFP4, there was a much higher probability that comparison of active compounds yielded a Tc value of at least 0.8 and 0.3, respectively, than comparison of active vs. random (or random vs. random) compounds. 
                    <xref ref-type="fig" rid="f5">Figure 5</xref> reports for all  activity class categories the percentages of Tc values of at least 0.8 (MACCS, 
                    <xref ref-type="fig" rid="f5">Figure 5a</xref>) and 0.3 (ECFP4, 
                    <xref ref-type="fig" rid="f5">Figure 5b</xref>). These percentages significantly increased for difficult over intermediate to easy activity classes, (again mirroring similarity search performance), reaching medians of 17.7% (MACCS) and 37.5% (ECFP4), with a significant spread of percentages among easy  activity classes, as revealed by the box plot representations in 
                    <xref ref-type="fig" rid="f5">Figure 5</xref>.</p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>Figure 5. </label>
                    <caption>
                        <title>Distribution of similarity values in activity-relevant ranges.</title>
                        <p>Box plots report the distribution of the percentage of Tc values falling into activity-relevant ranges for each category of activity classes using (
                            <bold>a</bold>) MACCS and (
                            <bold>b</bold>) ECFP4. A box plot gives the minimum percentage of Tc values in the activity-relevant range per category (bottom line), first quartile (lower boundary of the box), median value (thick line), third quartile (upper boundary of the box), and highest percentage of Tc values (top line).</p>
                    </caption>
                    <graphic orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/9296/d65f1c63-2da3-4d43-a7ac-d40be6f7c355_figure5.gif"/>
                </fig>
                <p>Taken together, these findings show that it was possible to detect activity-relevant Tc value ranges for different fingerprints by systematically comparing Tc value distributions for active and randomly selected compounds.</p>
            </sec>
            <sec>
                <title>Implications for similarity searching</title>
                <p>A key question was whether activity-relevant similarity values might also serve as threshold values for similarity searching, contrary to the conclusions drawn from earlier studies analyzing compound recall rates and rankings. In this context, the ratio of different compound comparisons involved in similarity search calculations must be considered, as discussed below. The activity-relevant Tc value ranges of &#x2265; 0.8 (MACCS) and &#x2265; 0.3 (ECFP4) derived from comparison of similarity value distributions are plausible likelihood estimates, as further supported by the data in 
                    <xref ref-type="table" rid="T1">Table 1</xref>. For example, for easy activity classes, 16% of MACCS Tc values for comparison of active compounds reached or exceeded 0.8, whereas this was only the case for 0.005% of random vs. active comparisons. For ECFP4, the corresponding percentages were 38.2% and 0.03%, respectively. However, in a typical similarity search trial, many more active vs. random compound comparisons are carried out than active vs. active comparisons, given that only small numbers of compounds with a specific activity are usually available in databases. For example, let us consider the most favorable case of easy activity classes, a search with MACCS using a single active reference compound, and a similarity threshold value of 0.8. If a database of 100,000 inactive and 50 active compounds were to be screened, eight active compounds would be detected together with five false-positives on the basis of the ratios given in 
                    <xref ref-type="table" rid="T1">Table 1</xref>. If 500,000 database compounds were to be screened, the number of false-positives would increase to 25 (given that the active vs. random distribution was normal). For ECFP4, applying a similarity threshold value of 0.3 under the same search conditions, screening a database with 50 active and 500,000 inactive compounds would result in 19 true- and 150 false-positives. Hence, even for easy activity classes, activity-relevant similarity values could not be reliably applied as thresholds in a typical similarity search scenario because of the large discrepancy in the number of different comparisons. Furthermore, for difficult search tasks, the number of true-positive detections would be reduced significantly and the number of false-positives would further increase (
                    <xref ref-type="table" rid="T1">Table 1</xref>).</p>
                <table-wrap id="T1" orientation="portrait" position="anchor">
                    <label>Table 1. </label>
                    <caption>
                        <title>Similarity values falling into activity-relevant ranges.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="center" colspan="3" rowspan="1">MACCS</th>
                            </tr>
                            <tr>
                                <th align="center" colspan="1" rowspan="2">Activity
                                    <break/>Class
                                    <break/>category</th>
                                <th align="center" colspan="2" rowspan="1">Number of Tc values &#x2265; 0.8</th>
                            </tr>
                            <tr>
                                <th align="center" colspan="1" rowspan="1">Act vs Act</th>
                                <th align="center" colspan="1" rowspan="1">Act vs Rand (10000
                                    <break/>compounds)</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="center" colspan="1" rowspan="1">
                                    <bold>
                                        <italic toggle="yes">Easy</italic>
                                    </bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">39178 
                                    <bold>(16.0%)</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">1613 
                                    <bold>(0.005%)</bold>
                                </td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="1">
                                    <bold>
                                        <italic toggle="yes">Intermediate</italic>
                                    </bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">538349 
                                    <bold>(5.0%)</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">27423 
                                    <bold>(0.01%)</bold>
                                </td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="1">
                                    <bold>
                                        <italic toggle="yes">Difficult</italic>
                                    </bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">559442 
                                    <bold>(1.2%)</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">70057 
                                    <bold>(0.01%)</bold>
                                </td>
                            </tr>
                            <tr>
                                <th align="center" colspan="3" rowspan="1">ECFP4</th>
                            </tr>
                            <tr>
                                <th align="center" colspan="1" rowspan="2">Activity
                                    <break/>Class
                                    <break/>category</th>
                                <th align="center" colspan="2" rowspan="1">Number of Tc values &#x2265; 0.3</th>
                            </tr>
                            <tr>
                                <th align="center" colspan="1" rowspan="1">Act vs Act</th>
                                <th align="center" colspan="1" rowspan="1">Act vs Rand (10000
                                    <break/>compounds)</th>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="1">
                                    <bold>
                                        <italic toggle="yes">Easy</italic>
                                    </bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">93611 
                                    <bold>(38.2%)</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">8282 
                                    <bold>(0.03%)</bold>
                                </td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="1">
                                    <bold>
                                        <italic toggle="yes">Intermediate</italic>
                                    </bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">1041694 
                                    <bold>(10.9%)</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">104003 
                                    <bold>(0.04%)</bold>
                                </td>
                            </tr>
                            <tr>
                                <td align="center" colspan="1" rowspan="1">
                                    <bold>
                                        <italic toggle="yes">Difficult</italic>
                                    </bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">1854822 
                                    <bold>(4.1%)</bold>
                                </td>
                                <td align="center" colspan="1" rowspan="1">233943 
                                    <bold>(0.05%)</bold>
                                </td>
                            </tr>
                        </tbody>
                    </table>
                    <table-wrap-foot>
                        <fn>
                            <p>Reported are the total number of pairwise compound comparisons and percentages of Tc values (bold) falling into activity-relevant ranges of similarity values identified for the MACCS and ECFP4 fingerprints. &#x201c;Act&#x201d; stands for active and &#x201c;Rand&#x201d; for random.</p>
                        </fn>
                    </table-wrap-foot>
                </table-wrap>
                <p>The percentages of compounds from different categories falling into activity-relevant similarity ranges reported in 
                    <xref ref-type="table" rid="T1">Table 1</xref> also helped to rationalize the relative performance of fingerprints in benchmark calculations. Such retrospective calculations typically focus on easy or intermediate activity classes (otherwise, mostly &#x201c;negative&#x201d; results would be obtained). In benchmark settings, ECFPs are often superior to MACCS and other standard fingerprints. If compound recall rates are determined, which is usually the case, but only possible in retrospective applications, ECFP4 is clearly favored over MACCS, given the much larger percentage of true-positive detections according to 
                    <xref ref-type="table" rid="T1">Table 1</xref>. However, it should also be noted that even for easy activity classes, ECFP4 only detected less than 40% of active compounds at activity-relevant similarity values. Thus, the false-negative rate was high, even more so for MACCS, indicating that the sensitivity of these fingerprints to active compounds in a similarity search scenario is low. This again reflects the fact that structure-activity information is not explicitly used in a fingerprint search.</p>
                <p>In a prospective similarity search application, when active compounds are sparse and unknown and source databases are large, it would be more difficult to draw a line between ECFP4 and MACCS, as discussed above. Then, the ability to identify novel hits will much depend on the specific features of active compounds and the capacity of different fingerprints to capture them.</p>
                <p>A plus of activity-relevant similarity values, as determined herein, is that they have been derived over many different activity classes and are thus general in nature. As such, they become a characteristic feature of a given fingerprint, although their utility for practical similarity searching is limited.</p>
            </sec>
        </sec>
        <sec sec-type="conclusions">
            <title>Conclusion</title>
            <p>In conclusion, in this study, we have addressed the issue of how, from a fundamental point of view, activity similarity might be related to molecular similarity calculated using fingerprints by focusing on systematic compound comparisons involved in similarity searching. The analysis has led to the introduction of activity-relevant similarity values as a characteristic feature of fingerprints of different design, which we consider useful as likelihood estimates. For example, given our ensemble of activity classes, the likelihood that a compound comparison yielded a Tc value of at least 0.8 for MACCS or 0.3 for ECFP4 was much higher for compounds sharing the same activity than randomly selected compounds. It was also much higher than for comparison of active vs. randomly selected compounds.</p>
        </sec>
        <sec>
            <title>Data availability</title>
            <p>ZENODO: Activity classes from different categories, doi: 
                <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.47315">http://dx.doi.org/10.5281/zenodo.47315</ext-link>
                <sup>
                    <xref ref-type="bibr" rid="ref-28">28</xref>
                </sup>
            </p>
        </sec>
    </body>
    <back>
        <ref-list>
            <ref id="ref-1">
                <label>1</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Johnson</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Maggiora</surname>
                            <given-names>GM</given-names>
                        </name>
					</person-group>:
                    <article-title>Concepts and applications of molecular similarity</article-title>. Wiley,<year>1990</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.researchandmarkets.com/reports/2244491/concepts_and_applications_of_molecular_similarity.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Willett</surname>
                            <given-names>P</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Barnard</surname>
                            <given-names>JM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Downs</surname>
                            <given-names>GM</given-names>
                        </name>
					</person-group>:
                    <article-title>Chemical similarity searching.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Comput Sci.</italic>
					</source>
                    <year>1998</year>;<volume>38</volume>(<issue>6</issue>):<fpage>983</fpage>&#x2013;<lpage>996</lpage>.
                    <pub-id pub-id-type="doi">10.1021/ci9800211</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bender</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Glen</surname>
                            <given-names>RC</given-names>
                        </name>
					</person-group>:
                    <article-title>Molecular similarity: a key technique in molecular informatics.</article-title>
                    <source>
						
                        <italic toggle="yes">Org Biomol Chem.</italic>
					</source>
                    <year>2004</year>;<volume>2</volume>(<issue>22</issue>):<fpage>3204</fpage>&#x2013;<lpage>3218</lpage>.
                    <pub-id pub-id-type="pmid">15534697</pub-id>
                    <pub-id pub-id-type="doi">10.1039/B409813G</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Maggiora</surname>
                            <given-names>G</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Vogt</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Stumpfe</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Molecular similarity in medicinal chemistry.</article-title>
                    <source>
						
                        <italic toggle="yes">J Med Chem.</italic>
					</source>
                    <year>2014</year>;<volume>57</volume>(<issue>8</issue>):<fpage>3186</fpage>&#x2013;<lpage>3204</lpage>.
                    <pub-id pub-id-type="pmid">24151987</pub-id>
                    <pub-id pub-id-type="doi">10.1021/jm401411z</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Eckert</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches.</article-title>
                    <source>
						
                        <italic toggle="yes">Drug Discov Today.</italic>
					</source>
                    <year>2007</year>;<volume>12</volume>(<issue>5&#x2013;6</issue>):<fpage>225</fpage>&#x2013;<lpage>233</lpage>.
                    <pub-id pub-id-type="pmid">17331887</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.drudis.2007.01.011</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Stumpfe</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Similarity searching.</article-title>
                    <source>
						
                        <italic toggle="yes">Wiley Interdiscip Rev Comput Mol Sci.</italic>
					</source>
                    <year>2011</year>;<volume>1</volume>(<issue>2</issue>):<fpage>260</fpage>&#x2013;<lpage>282</lpage>.
                    <pub-id pub-id-type="doi">10.1002/wcms.23</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Maggiora</surname>
                            <given-names>GM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Shanmugasundaram</surname>
                            <given-names>V</given-names>
                        </name>
					</person-group>:
                    <article-title>Molecular similarity measures.</article-title>
                    <source>
						
                        <italic toggle="yes">Methods Mol Biol.</italic>
					</source>
                    <year>2004</year>;<volume>275</volume>:<fpage>1</fpage>&#x2013;<lpage>50</lpage>.
                    <pub-id pub-id-type="pmid">15141108</pub-id>
                    <pub-id pub-id-type="doi">10.1385/1-59259-802-1:001</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Willett</surname>
                            <given-names>P</given-names>
                        </name>
					</person-group>:
                    <article-title>Similarity-based virtual screening using 2D fingerprints.</article-title>
                    <source>
						
                        <italic toggle="yes">Drug Discov Today.</italic>
					</source>
                    <year>2006</year>;<volume>11</volume>(<issue>23&#x2013;24</issue>):<fpage>1046</fpage>&#x2013;<lpage>1053</lpage>.
                    <pub-id pub-id-type="pmid">17129822</pub-id>
                    <pub-id pub-id-type="doi">10.1016/j.drudis.2006.10.005</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname> Vogt</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Stumpfe</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Geppert</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Scaffold hopping using two-dimensional fingerprints: true potential, black magic, or a hopeless endeavor? Guidelines for virtual screening.</article-title>
                    <source>
						
                        <italic toggle="yes">J Med Chem.</italic>
					</source>
                    <year>2010</year>;<volume>53</volume>(<issue>15</issue>):<fpage>5707</fpage>&#x2013;<lpage>5715</lpage>.
                    <pub-id pub-id-type="pmid">20684607</pub-id>
                    <pub-id pub-id-type="doi">10.1021/jm100492z</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-10">
                <label>10</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Gardiner</surname>
                            <given-names>EJ</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Holliday</surname>
                            <given-names>JD</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>O'Dowd</surname>
                            <given-names>C</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Effectiveness of 2D fingerprints for scaffold hopping.</article-title>
                    <source>
						
                        <italic toggle="yes">Future Med Chem.</italic>
					</source>
                    <year>2011</year>;<volume>3</volume>(<issue>4</issue>):<fpage>405</fpage>&#x2013;<lpage>414</lpage>.
                    <pub-id pub-id-type="pmid">21452977</pub-id>
                    <pub-id pub-id-type="doi">10.4155/fmc.11.4</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-11">
                <label>11</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Willett</surname>
                            <given-names>P</given-names>
                        </name>
					</person-group>:
                    <article-title>Similarity and clustering in chemical information systems.</article-title>Research Studies Press, Letchworth, Hertfordshire, England,<year>1987</year>.</mixed-citation>
            </ref>
            <ref id="ref-12">
                <label>12</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Brown</surname>
                            <given-names>RD</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Martin</surname>
                            <given-names>YC</given-names>
                        </name>
					</person-group>:
                    <article-title>The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Comput Sci.</italic>
					</source>
                    <year>1997</year>;<volume>37</volume>(<issue>1</issue>):<fpage>1</fpage>&#x2013;<lpage>9</lpage>.
                    <pub-id pub-id-type="doi">10.1021/ci960373c</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-13">
                <label>13</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Patterson</surname>
                            <given-names>DE</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Cramer</surname>
                            <given-names>RD</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Ferguson</surname>
                            <given-names>AM</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Neighborhood behavior: a useful concept for validation of &#x201c;molecular diversity&#x201d; descriptors.</article-title>
                    <source>
						
                        <italic toggle="yes">J Med Chem.</italic>
					</source>
                    <year>1996</year>;<volume>39</volume>(<issue>16</issue>):<fpage>3049</fpage>&#x2013;<lpage>3059</lpage>.
                    <pub-id pub-id-type="pmid">8759626</pub-id>
                    <pub-id pub-id-type="doi">10.1021/jm960290n</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-14">
                <label>14</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Durant</surname>
                            <given-names>JL</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Leland</surname>
                            <given-names>BA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Henry</surname>
                            <given-names>DR</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Reoptimization of MDL keys for use in drug discovery.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Comput Sci.</italic>
					</source>
                    <year>2002</year>;<volume>42</volume>(<issue>6</issue>):<fpage>1273</fpage>&#x2013;<lpage>1280</lpage>.
                    <pub-id pub-id-type="pmid">12444722</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci010132r</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Martin</surname>
                            <given-names>YC</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Kofron</surname>
                            <given-names>JL</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Traphagen</surname>
                            <given-names>LM</given-names>
                        </name>
					</person-group>:
                    <article-title>Do structurally similar molecules have similar biological activity?</article-title>
                    <source>
						
                        <italic toggle="yes">J Med Chem.</italic>
					</source>
                    <year>2002</year>;<volume>45</volume>(<issue>19</issue>):<fpage>4350</fpage>&#x2013;<lpage>4358</lpage>.
                    <pub-id pub-id-type="pmid">12213076</pub-id>
                    <pub-id pub-id-type="doi">10.1021/jm020155c</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Vogt</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Introduction of a generally applicable method to estimate retrieval of active molecules for similarity searching using fingerprints.</article-title>
                    <source>
						
                        <italic toggle="yes">ChemMedChem.</italic>
					</source>
                    <year>2007</year>;<volume>2</volume>(<issue>9</issue>):<fpage>1311</fpage>&#x2013;<lpage>1320</lpage>.
                    <pub-id pub-id-type="pmid">17562536</pub-id>
                    <pub-id pub-id-type="doi">10.1002/cmdc.200700090</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Vogt</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Introduction of the conditional correlated Bernoulli model of similarity value distributions and its application to the prospective prediction of fingerprint search performance.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2011</year>;<volume>51</volume>(<issue>10</issue>):<fpage>2496</fpage>&#x2013;<lpage>2506</lpage>.
                    <pub-id pub-id-type="pmid">21892818</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci2003472</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Muchmore</surname>
                            <given-names>SW</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Debe</surname>
                            <given-names>DA</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Metz</surname>
                            <given-names>JT</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Application of belief theory to similarity data fusion for use in analog searching and lead hopping.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2008</year>;<volume>48</volume>(<issue>5</issue>):<fpage>941</fpage>&#x2013;<lpage>948</lpage>.
                    <pub-id pub-id-type="pmid">18416545</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci7004498</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Rogers</surname>
                            <given-names>D</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hahn</surname>
                            <given-names>M</given-names>
                        </name>
					</person-group>:
                    <article-title>Extended-connectivity fingerprints.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2010</year>;<volume>50</volume>(<issue>5</issue>):<fpage>742</fpage>&#x2013;<lpage>754</lpage>.
                    <pub-id pub-id-type="pmid">20426451</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci100050t</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-20">
                <label>20</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Petrone</surname>
                            <given-names>PM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Simms</surname>
                            <given-names>B</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Nigsch</surname>
                            <given-names>F</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Rethinking molecular similarity: comparing compounds on the basis of biological activity.</article-title>
                    <source>
						
                        <italic toggle="yes">ACS Chem Biol.</italic>
					</source>
                    <year>2012</year>;<volume>7</volume>(<issue>8</issue>):<fpage>1399</fpage>&#x2013;<lpage>1409</lpage>.
                    <pub-id pub-id-type="pmid">22594495</pub-id>
                    <pub-id pub-id-type="doi">10.1021/cb3001028</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-21">
                <label>21</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Wassermann</surname>
                            <given-names>AM</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Lounkine</surname>
                            <given-names>E</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Glick</surname>
                            <given-names>M</given-names>
                        </name>
					</person-group>:
                    <article-title>Bioturbo similarity searching: combining chemical and biological similarity to discover structurally diverse bioactive molecules.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2013</year>;<volume>53</volume>(<issue>3</issue>):<fpage>692</fpage>&#x2013;<lpage>703</lpage>.
                    <pub-id pub-id-type="pmid">23461561</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci300607r</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-22">
                <label>22</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Bento</surname>
                            <given-names>AP</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Gaulton</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hersey</surname>
                            <given-names>A</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>The ChEMBL bioactivity database: an update.</article-title>
                    <source>
						
                        <italic toggle="yes">Nucleic Acids Res.</italic>
					</source>
                    <year>2014</year>;<volume>42</volume>(<issue>Database issue</issue>):<fpage>D1083</fpage>&#x2013;<lpage>D1090</lpage>.
                    <pub-id pub-id-type="pmid">24214965</pub-id>
                    <pub-id pub-id-type="doi">10.1093/nar/gkt1031</pub-id>
                    <pub-id pub-id-type="pmcid">3965067</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-23">
                <label>23</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Heikamp</surname>
                            <given-names>K</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Large-scale similarity search profiling of ChEMBL compound data sets.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2011</year>;<volume>51</volume>(<issue>8</issue>):<fpage>1831</fpage>&#x2013;<lpage>1839</lpage>.
                    <pub-id pub-id-type="pmid">21728295</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci200199u</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Hu</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Influence of search parameters and criteria on compound selection, promiscuity, and pan assay interference characteristics.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2014</year>;<volume>54</volume>(<issue>11</issue>):<fpage>3056</fpage>&#x2013;<lpage>3066</lpage>.
                    <pub-id pub-id-type="pmid">25329977</pub-id>
                    <pub-id pub-id-type="doi">10.1021/ci5005509</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Sterling</surname>
                            <given-names>T</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Irwin</surname>
                            <given-names>JJ</given-names>
                        </name>
					</person-group>:
                    <article-title>ZINC 15--ligand discovery for everyone.</article-title>
                    <source>
						
                        <italic toggle="yes">J Chem Inf Model.</italic>
					</source>
                    <year>2015</year>;<volume>55</volume>(<issue>11</issue>):<fpage>2324</fpage>&#x2013;<lpage>2337</lpage>.
                    <pub-id pub-id-type="pmid">26479676</pub-id>
                    <pub-id pub-id-type="doi">10.1021/acs.jcim.5b00559</pub-id>
                    <pub-id pub-id-type="pmcid">4658288</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Eckert</surname>
                            <given-names>H</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Bajorath</surname>
                            <given-names>J</given-names>
                        </name>
					</person-group>:
                    <article-title>Apparent asymmetry in fingerprint similarity searching is a direct consequence of differences in bit densities and molecular size.</article-title>
                    <source>
						
                        <italic toggle="yes">ChemMedChem.</italic>
					</source>
                    <year>2007</year>;<volume>2</volume>(<issue>7</issue>):<fpage>1037</fpage>&#x2013;<lpage>1042</lpage>.
                    <pub-id pub-id-type="pmid">17506042</pub-id>
                    <pub-id pub-id-type="doi">10.1002/cmdc.200700050</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref-27">
                <label>27</label>
                <mixed-citation publication-type="book">
                    <article-title>RDKit: Cheminformatics and Machine Learning Software</article-title>.<year>2013</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://www.rdkit.org">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref-28">
                <label>28</label>
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">
						
                        <name name-style="western">
                            <surname>Jasial</surname>
                            <given-names>S</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Hu</surname>
                            <given-names>Y</given-names>
                        </name>
						
                        <name name-style="western">
                            <surname>Vogt</surname>
                            <given-names>M</given-names>
                        </name>
						
                        <etal/>
					</person-group>:
                    <article-title>Activity classes from different categories.</article-title>
                    <source>
						
                        <italic toggle="yes">ZENODO.</italic>
					</source>
                    <year>2016</year>.
                    <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.5281/zenodo.47315">Data Source</ext-link>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report13264">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.8986.r13264</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Warr</surname>
                        <given-names>Wendy</given-names>
                    </name>
                    <xref ref-type="aff" rid="r13264a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r13264a1">
                    <label>1</label>Wendy Warr &amp; Associates, Holmes Chapel, UK</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>12</day>
                <month>4</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Warr W</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport13264" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.8357.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>This is a short but interesting paper and is extremely well written. I use the adjective &#x201c;short&#x201d; because the novel results occupy fewer than eight pages, if the figures and tables are ignored. Nevertheless, the results are interesting and well worth publishing because they do address a significant problem. The issue in question is well explained on pages 3-5. The literature background appears to cover all relevant research, but I suggest that two of the references be changed. I would replace the ACS meeting abstract cited at 12 with Brown and Martin, 1997
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-13264-1">1</xref>
                </sup>. That paper does not mention &#x201c;85%&#x201d; specifically, but it does discuss the cutoff threshold in detail. Reference 14 is useless: a researcher novel to the field of similarity could not locate MACCS keys by seeking a non-existent company which had an office in San Leandro in 2005. I would prefer to see &#x201c;activity classes&#x201d; written out in full: it is not a long-winded term, and ACs looks a bit like a typo for ACS.</p>
            <p>On page 8 the sentence &#x201c;Thus, similarity searching can be mimicked by systematically comparing compounds having the same activity and active compounds to random database compounds&#x201d; is not clear enough. Further down the page it is made clear exactly what is compared with what, but that is too late. I also did not fully understand the statement &#x201c;Comparison of random database compounds is not carried out during similarity searching. However, the similarity value distribution resulting from the latter comparison can be monitored as an additional reference.&#x201d; Maybe: &#x201c;Comparison of random database compounds to random database compounds is not carried out during traditional similarity searching, but the similarity value distribution resulting from this comparison can be monitored as an additional reference in mimicked similarity searching&#x201d;? (Note also that it is better not to start a sentence with &#x201c;However&#x201d;.)</p>
            <p>On page 13 there should be a heading saying &#x201c;Conclusion&#x201d; before the sentence that begins &#x201c;In conclusion&#x201d;.</p>
            <p>The sentence beginning &#x201c;In conclusion, in this study, we have addressed the issue how molecular similarity calculated using fingerprints and activity similarity might be related to each other from a fundamental point of view&#x2026;&#x201d; is ambiguous, e.g., &#x201c;&#x2026;molecular similarity calculated using both fingerprints and activity similarity, might be related to what?&#x201d; Admittedly there is no comma, but it would be clearer to say &#x201c;In conclusion, in this study, we have addressed the issue of how, from a fundamental point of view, activity similarity might be related to molecular similarity calculated using fingerprints&#x2026;&#x201d; At the very end the phrase &#x201c;&#x2026;was hundreds of times higher for compounds sharing the same activity than randomly selected or active vs. random compounds.&#x201d; is not clear enough.</p>
            <p>In short, I like the science, and I think it should be indexed, but I would like to see a few minor improvements to the text as detailed above.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-13264-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding</article-title>.
                        <source>
                            <italic>Journal of Chemical Information and Computer Sciences</italic>
                        </source>.<year>1997</year>;<volume>37</volume>(<issue>1</issue>) :
                        <elocation-id>10.1021/ci960373c</elocation-id>
                        <fpage>1</fpage>-<lpage>9</lpage>
                        <pub-id pub-id-type="doi">10.1021/ci960373c</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report13265">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.8986.r13265</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Ertl</surname>
                        <given-names>Peter</given-names>
                    </name>
                    <xref ref-type="aff" rid="r13265a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6496-4448</uri>
                </contrib>
                <aff id="r13265a1">
                    <label>1</label>Novartis Institutes for Biomedical Research, CH-4056 Basel, Switzerland</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>12</day>
                <month>4</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 Ertl P</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport13265" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.8357.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>An interesting manuscript focusing on a relationship between molecule similarity and biological activity, one of the most important (and still not fully solved) problems of applied cheminformatics. The topic is therefore relevant to drug design.</p>
            <p>The question the authors are trying to answer is the significance of similarity thresholds when using MACCS and ECFP4 fingerprints and its implications in virtual screening, when one tries to identify small number of active molecules in the large number of inactives.</p>
            <p>There have been several studies focusing on the same question (an influence of a similarity thresholds on discriminating active and inactive molecules). Although such studies are mentioned in the literature overview it would be interesting to directly compare their conclusions with the conclusions of the present paper in the &#x201c;Implications for similarity searching&#x201d; section.</p>
            <p>The information content of the MACCS keys and the ECFP fingerprints are vastly different. The MACCS keys are, to my knowledge, no more used in a productive set-up as molecule descriptors in discriminating between actives and inactives. It would be interesting to focus on additional, more relevant structure descriptors, for example Daylight-like linear fingerprints or topological torsions. I suggest this as a topic for a follow-up study.</p>
            <p>The authors should mention which software they used for the calculation of fingerprints. Did they used PipelinePilot, open source tools or their own software? Results generated by different software tools may differ in some cases considerably, based on different molecule normalization, treatment of aromaticity, tautomers etc..</p>
            <p>I recommend to mention also a classical paper from this area by Brown and Martin
                <sup>
                    <xref ref-type="bibr" rid="rep-ref-13265-1">1</xref>
                </sup>.</p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-13265-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding</article-title>.
                        <source>
                            <italic>Journal of Chemical Information and Computer Sciences</italic>
                        </source>.<year>1997</year>;<volume>37</volume>(<issue>1</issue>) :
                        <elocation-id>10.1021/ci960373c</elocation-id>
                        <fpage>1</fpage>-<lpage>9</lpage>
                        <pub-id pub-id-type="doi">10.1021/ci960373c</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment1918-13265">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bajorath</surname>
                            <given-names>J&#x00fc;rgen</given-names>
                        </name>
                        <aff>University of Bonn, Germany</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>13</day>
                    <month>4</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>Thank you for suggesting the follow-up investigation. We note that the results of the two most relevant investigations were discussed in the introduction.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report13263">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.8986.r13263</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>McGaughey</surname>
                        <given-names>Georgia B.</given-names>
                    </name>
                    <xref ref-type="aff" rid="r13263a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r13263a1">
                    <label>1</label>Vertex Pharmaceuticals, Boston, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>7</day>
                <month>4</month>
                <year>2016</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2016 McGaughey GB</copyright-statement>
                <copyright-year>2016</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport13263" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.8357.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The authors summarize in detail the threshold between actives and inactives using 2D fingerprints for the MACCS and ECFP4 fingerprint methods using data derived from ChEMBL. The paper is well written and should be indexed.&#x00a0; A few suggestions are made, however:</p>
            <p>&#x00a0;
                <list list-type="order">
                    <list-item>
                        <p>Given that this is a chemistry paper, perhaps a few examples of chemical compounds showing the threshold for an active in Tc and ECFP4 space. How &#x201c;low&#x201d; can one go and still have an active? This would bolster the need for chemoinformatic approaches over the medchemists&#x2019; view of &#x201c;eyeing&#x201d; similarity.</p>
                    </list-item>
                    <list-item>
                        <p>You have numerous references and mention belief theory in passing. I couldn&#x2019;t help but think of Muchmore&#x2019;s paper 
                            <sup>
                                <xref ref-type="bibr" rid="rep-ref-13263-1">1</xref>
                            </sup> and think you might want to include this paper as well especially given that he uses MACCS and ECFP4.</p>
                    </list-item>
                    <list-item>
                        <p>You make no mention of 3D similarity methods, which even in passing, I recommend you include (ie a reference). I have one in JCIM from 2006 comparing 2D to 3D (but it&#x2019;s not pairwise).</p>
                    </list-item>
                    <list-item>
                        <p>What about the overlap between the methods in terms of actives?&#x00a0; The result begs the question to the reader &#x2013; do I now compute both and take an average (if I can only screen X%)?</p>
                    </list-item>
                    <list-item>
                        <p>The threshold of 10uM for an active seems 
                            <italic>extremely</italic> generous. In practice, I would typically consider this an inactive compound especially if the screen was an enzymatic screen. How would the results differ if you used a different active threshold?</p>
                    </list-item>
                    <list-item>
                        <p>Besides molecular weight, was there any consideration given to the number of PAINS or REOS flags? By this question I&#x2019;m trying to understand if &#x201c;actives&#x201d; were easier to discriminate if compounds were merely promiscuous and if that mattered based on the easy-intermediate-hard ACs (have to agree with Wendy Warr on this &#x2013; if you could spell out AC &#x2013; I kept thinking activity cliffs).</p>
                    </list-item>
                </list>
            </p>
            <p>Reviewer Expertise:</p>
            <p>NA</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <back>
            <ref-list>
                <title>References</title>
                <ref id="rep-ref-13263-1">
                    <label>1</label>
                    <mixed-citation publication-type="journal">
                        <person-group person-group-type="author"/>:
                        <article-title>Application of belief theory to similarity data fusion for use in analog searching and lead hopping.</article-title>
                        <source>
                            <italic>J Chem Inf Model</italic>
                        </source>.<year>2008</year>;<volume>48</volume>(<issue>5</issue>) :
                        <elocation-id>10.1021/ci7004498</elocation-id>
                        <fpage>941</fpage>-<lpage>8</lpage>
                        <pub-id pub-id-type="pmid">18416545</pub-id>
                        <pub-id pub-id-type="doi">10.1021/ci7004498</pub-id>
                    </mixed-citation>
                </ref>
            </ref-list>
        </back>
        <sub-article article-type="response" id="comment1919-13263">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Bajorath</surname>
                            <given-names>J&#x00fc;rgen</given-names>
                        </name>
                        <aff>University of Bonn, Germany</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>13</day>
                    <month>4</month>
                    <year>2016</year>
                </pub-date>
            </front-stub>
            <body>
                <p>It is noted that the Muchmore 
                    <italic>et al</italic>. reference was already cited. In addition, 3D similarity measures were not considered herein. By design, the study did neither focus on compound overlap for alternative fingerprint representations nor on general compound liabilities.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
