Follow up: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer

In 2012, we reported 30 compound data sets and/or programs developed in our laboratory in a data article and made them freely available to the scientific community to support chemoinformatics and computational medicinal chemistry applications. These data sets and computational tools were provided for download from our website. Since publication of this data article, we have generated 13 new data sets with which we further extend our collection of publicly available data and tools. Due to changes in web servers and website architectures, data accessibility has recently been limited at times. Therefore, we have also transferred our data sets and tools to a public repository to ensure full and stable accessibility. To aid in data selection, we have classified the data sets according to scientific subject areas. Herein, we describe new data sets, introduce the data organization scheme, summarize the database content and provide detailed access information in ZENODO (doi: 10.5281/zenodo.8451 and doi:10.5281/zenodo.8455).


Introduction
The compound data sets reported in our original article 1 and the new data sets presented herein have resulted from research in the chemoinformatics and medicinal chemistry area and have mostly been generated from public domain repositories of compound structures and activity data. In addition, software tools made publicly available have also been developed in our laboratory 1 . Data sets reported in the scientific literature in the context of computational method development and evaluation are often not publicly available, which limits the reproducibility of computational investigations and comparisons of different computational methods. We believe that it is important to provide such data to the scientific community to further improve the transparency and credibility of computational studies and support method development. In addition to the data sets designed for the development and evaluation of computational methods, we also make available data sets that were generated as a resource and knowledge base for medicinal chemistry applications. Our data sets and tools are provided via the ZENODO platform (https://zenodo.org/) to ensure easy and stable access.

Materials and methods
The data sets reported herein were predominantly generated from ChEMBL 2,3 , BindingDB 4 and PubChem 5 (a few exceptions are specified in the original data article 1 ). Compound structures are represented as SMILES 6 strings or SD files 7 . Activity information and other (data set-dependent) annotations are provided in the individual data files. For software tools (written in different languages), the source code is also made available. Table 1 provides the updated list and classification of all freely available data sets and programs. Entries were organized according to the following scientific subject areas: data sets for structure-activity relationship (SAR) and structure-selectivity relationship (SSR) analysis, SAR visualization (SAR_VZ), and virtual screening via similarity searching or machine learning (VS_ML). In addition, the programs are provided separately (PROG). Data sets and programs are contained in separate ZENODO deposition sets with a unique reference. Three matched molecular pair (MMP)-based data sets also included in our update have recently been reported and described in detail 8 . Entries 1-30 in Table 1 represent the data sets and programs that we initially provided via our website 1 and entries 31-43 represent new data sets. In the following, the new data sets are described:

Data description
Entry 31 50 compound activity classes (AC) are prioritized for the evaluation of scaffold hopping potential in ligand-based virtual screening 38 . These AC contain the largest proportion of scaffold pairs with largest chemical inter-scaffold distances 38 that can be derived from current bioactive compounds and hence present challenging test cases for scaffold hopping analysis.

Entry 32
596 SAR transfer series with regular potency progression (SAR-TS-RP) are extracted from 61 AC 39 . Each SAR-TS-RP represents two compound series with different core structures and pairwise corresponding substitutions that yield comparable potency progression against a given target. These series provide a knowledge base for the analysis and prediction of SAR transfer events.

Entry 33
Four sets of molecular scaffolds (with each scaffold representing more than ten compounds) are provided that are active against a single target (ST), multiple targets from the same family (SF), or multiple targets from different families (MF) 40 . Data sets are separately assembled for different types of potency measurements (i.e., K i and IC 50 values) and provide a resource of scaffolds representing compounds with varying degrees of target promiscuity.

Entry 34
Two multi-target compound data sets consist of confirmed screening hits 41 . Each set contains compounds with single-, dual-, and triple-target activity, or no activity. These data provide test cases for machine learning or other approaches to differentiate between compounds with overlapping yet distinct activity profiles.

Entry 35
Four multi-target compound data sets are provided 42 . Each set contains compounds tested in three different assays. Compounds are organized into eight different subsets according to their activity profiles, i.e., single-, dual-, and triple-target activity, or no activity. In addition, three multi-mechanism compound sets are designed 42 . In the latter case, compounds are organized into four subsets according to their mechanism-of-action. These data sets also represent test cases for machine learning to distinguish compounds with different activity profiles or mechanisms.
Entry 36 2337 non-redundant compound series matrices (CSMs) are generated covering compounds active against a wide spectrum of targets 43 . Each matrix contains at least two analogous matching molecular series (MMS) with structurally related yet distinct cores. A matrix consists of known active compounds and structurally related virtual compounds and hence provides suggestions for compound design.
Entry 37 128 target-based data sets are assembled that consist of at least 100 compounds with precisely specified equilibrium constants (K i values) below 1 µM for human targets 44 . These high-confidence activity data sets provide a sound basis for SAR exploration. Entry 39 221 scaffolds are identified that only occur in approved drugs but are not found in currently available bioactive compounds 46 . Accordingly, these scaffolds have been termed drug-unique scaffolds.

Summary
Herein we have provided an updated release of data sets and programs for chemoinformatics and medicinal chemistry that we make freely available. In total, 13 new data sets are introduced.
Transferring all data entries in an organized form to the ZENODO platform makes them easily accessible. We hope that our current release might be of interest and helpful to many investigators in academia and the pharmaceutical industry. Author contributions JB designed the study, YH collected and organized the data, YH and JB wrote the manuscript.

Competing interests
No competing interests were declared.

Grant information
The author(s) declared that no grants were involved in supporting this work.
the progression of compound activity in target space and to extract