ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Machine learning meets pKa

[version 1; peer review: 2 approved]
PUBLISHED 13 Feb 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Cheminformatics gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Python collection.

This article is included in the Mathematical, Physical, and Computational Sciences collection.

Abstract

We present a small molecule pKa prediction tool entirely written in Python. It predicts the macroscopic pKa value and is trained on a literature compilation of monoprotic compounds. Different machine learning models were tested and random forest performed best given a five-fold cross-validation (mean absolute error=0.682, root mean squared error=1.032, correlation coefficient r2 =0.82). We test our model on two external validation sets, where our model performs comparable to Marvin and is better than a recently published open source model. Our Python tool and all data is freely available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.

Keywords

machine learning, pKa value, protonation, dissociation

Introduction

The acid-base dissociation constant (pKa) of a drug has a far-reaching influence on pharmacokinetics by altering the solubility, membrane permeability and protein binding affinity of the drug. Several publications summarize these findings in a very comprehensive manner17. An accurate estimation of pKa values is therefore of utmost importance for successful drug design. Several (commercial and non-commercial) tools and approaches for small molecule pKa prediction are available: MoKa8 uses molecular interaction fields, whereas ACD/Labs Percepta Classic9, Marvin10 and Epik11 make use of the Hammet-Taft equation. By means of Jaguar12, a quantum-mechanical approach to pKa prediction becomes possible. Recently, the usage of neural nets for pKa prediction became prominent1315. In particular, the publication by Williams et al.15 make use of a publicly available data set provided by the application DataWarrior16 and provides a freely available pKa prediction tool called OPERA.

As this article is part of a Python collection issue, we provide a pKa prediction method entirely written in Python17 and make it available open source (including all data). Our tool computes the macroscopic pKa value for a monoprotic compound. Our model solely differentiates between a base and acid based on the predicted pKa value; i.e., we do not offer separate models for acids and bases. In addition to pKa data from DataWarrior16, we also employ pKa data from ChEMBL18. As external validation sets, we use compound data provided by Novartis19 and a manually curated data set compiled from literature2024, which are not part of the training data.

Methods

Different experimental methods

One crucial point in the field of pKa measurements (and its usage for pKa predictions) is linked to the different experimental methods25,26. Based on the Novartis set, the correlation between capillary electrophoresis and potentiometric measurements (for 15 data points) is convincing enough (mean absolute error (MAE)=0.202, root mean squared error (RMSE)=0.264, correlation coefficient r2=0.981) for us to combine pKa measurements from these different experimental methods (see Figure 1).

6811d445-4bd8-49a9-95ad-ec460c6326eb_figure1.gif

Figure 1. Correlation of Novartis compounds measured in potentiometric and high-throughput (capillary electrophoresis) set-up.

MAE, mean absolute error; RMSE, root mean square error.

We also compare the overlap of the filtered (see next section) ChEMBL and DataWarrior data sets, 187 monoprotic molecules could be identified in both sources. Due to the missing annotation, it remains unclear if different experimental methods were used or multiple measurements with the same experimental method have been performed (or a mixture of both). Either way, this comparison is an additional proof-of-concept that the ChEMBL and DataWarrior pKa data sources can be combined after careful curation. The aforementioned intersection is given in Figure 2. The correlation coefficient between the annotated pKa values for these two data sets r2 is 0.949, the MAE is 0.275, and the RMSE is 0.576.

6811d445-4bd8-49a9-95ad-ec460c6326eb_figure2.gif

Figure 2. Intersection between ChEMBL and DataWarrior data sets.

MAE, mean absolute error; RMSE, root mean square error.

Data set preparation

A ChEMBL18 web search was performed to find all assays containing pKa measurement data. The following restrictions were made: it must be a physicochemical assay, the measurements must be taken from scientific literature, the assay must be in "small-molecule physicochemical format" and the organism taxonomy must be set to "N/A". This results in a list of 1140 ChEMBL assays downloaded as CSV file. Using a Python script, the CSV file was read in and processed further, extracting all additional information required from an internally hosted copy of the ChEMBL database via SQL. Only pKa measurements, i.e. ChEMBL activities, were taken into account that were specified as exact ("standard_relation" equals "=") and for which one of the following names was specified as "standard_type": "pka", "pka value", "pka1", "pka2", "pka3" or "pka4" (case-insensitive). Measured values for which the molecular structure was not available were also sorted out. The resulting 8111 pKa measured values were saved as SDF file.

A flat file from DataWarrior16 named “pKaInWater.dwar” was used in addition. In this case, the file was converted to an SDF file only and contains 7911 entries with valid molecular structures.

These data sets were concatenated for the purpose of this study and preprocessed as follows:

  • Remove all salts from molecules

  • Remove molecules containing nitro groups, Bor, Selenium or Silicium

  • Filter by Lipinski‘s rule of five (one violation allowed)

  • Keep only pKa data points between 2 and 12

  • Tautomer standardization of all molecules

  • Protonate all molecules at pH 7.4

  • Keep only monoprotic molecules regarding the specified pKa range

  • Combine data points from duplicated structures while removing outliers

All steps up to filtering out all pKa values outside the range of 2 to 12 were performed with Python and RDKit27. The QuacPac28 Tautomers tool from OpenEye was used for tautomer standardization and setting the protonation state to pH 7.4. The Marvin10 tool from ChemAxon was used to filter out the multiprotic compounds. It predicted the pKa values of all molecules in the range 2 to 12 and then retained only those molecules where Marvin did not predict more than one pKa in that range.

The removal of the outliers is performed in two steps. First, before combining multiple measurements for the same molecules, all entries where the pKa predicted by ChemAxon's Tool Marvin differs from the experimental value by more than four log units are removed. All molecules were then combined on the basis of the isomeric SMILES. In the second step, when combining several measured values of a molecule, all those values that deviate from the mean value by more than two standard deviations are removed. The remaining values are arithmetically averaged.

After all, 5994 unique monoprotic molecules with experimental pKa values remained. The distribution of pKa values is given in Figure 3. The same preprocessing steps were also performed on an external test data set provided to us by Novartis19 (280 molecules) and a manual curation (123 molecules) from literature2024.

6811d445-4bd8-49a9-95ad-ec460c6326eb_figure3.gif

Figure 3. Distribution of the individual pKa values.

Learning

First, to simplify cross-validation, a class "CVRegressor" was defined, which can serve as a wrapper for any regressor implementing the Scikit-Learn29 interface. This class simplifies cross-validation itself, training and prediction with the cross-validated model. Next, 196 of the 200 available RDKit descriptors ("MaxPartialCharge", "MinPartialCharge", "MaxAbsPartialCharge" and "MinAbsPartialCharge" were not used because they are computed as "NaN" for many molecules), and a 4096-bit long MorganFeature fingerprint with radius 3 were calculated for the training data set. Random forest (RF), support vector regression (SVR, two configurations), multilayer perceptron (MLP, three configurations) and XGradientBoost (XGB) were used as basic regressors. Unless otherwise specified, the Scikit-Learn default parameters (version 0.22.1) were used. For the RF model, only the number of trees was increased to 1000. For SVR the size of the cache was increased to 4096 megabytes in the first configuration, but this only increases the training speed and has no influence on the model quality. In the second configuration the parameter "gamma" was additionally set to the value "auto". For MLP in the first configuration the number of hidden layers was increased to two and the number of neurons per layer to 500. In the second configuration, early stopping was additionally activated, where 10% of the training data is separated as validation data. If the error of the validation data does not improve by more than 0.001 over ten training epochs, the training is stopped early to avoid overtraining. In the third configuration three hidden layers with a size of 250 neurons each were used with early stopping still activated. For XGB the default parameters of the used library XGBoost (version 0.90)30 were applied. The training of RF, MLP and XGB was parallelized on 12 CPU cores and the generation of the folds for cross-validation as well as the training itself were random seeded to a value of 24 to ensure reproducibility. This results in a total of seven different machine learning configurations.

Six different descriptor/fingerprint combinations were also tested. First only the RDKit descriptors, followed by only the fingerprints and finally both combined. Additionally, all three combinations were tested again in a standardized form (z-transformed). As a result, 42 combinations of regressor and training data configurations were compared.

A 5-fold cross-validation was performed for all configurations, which were evaluated using the MAE, RMSE and the empirical coefficient of determination (r2). After training was completed for all configurations, two external test data sets, which do not contain training data, were used to re-validate each trained cross-validated model. Here, MAE, RMSE, and r2 were also calculated as statistical quality measures.

Implementation

The Python dependencies are as follows: Python >= 3.7, NumPy >= 1.18, Scikit-Learn >= 0.22, RDKit >= 2019.09.3, Pandas >= 0.25, XGBoost >= 0.90, JupyterLab >= 1.2, Matplotlib >= 3.1, Seaborn >= 0.9

For the data preparation pipeline, ChemAxon Marvin10 and OpenEye QUACPAC/Tautomers28 are required. To use the provided prediction model with the included Python script, ChemAxon Marvin10 is not required.

First of all you need a working Miniconda/Anaconda installation. You can get Miniconda at https://conda.io/en/latest/miniconda.html.

Now you can create an environment named "ml_pka" with all needed dependencies and activate it with:

conda env create -f environment.yml
conda activate ml_pka

You can also create a new environment by yourself and install all dependencies without the environment.yml file:

conda create -n ml_pka python=3.7
conda activate ml_pka

In case of Linux or macOS:

conda install -c defaults -c rdkit -c conda-forge scikit-learn rdkit xgboost jupyterlab matplotlib seaborn

In case of Windows:

conda install -c defaults -c rdkit scikit-learn rdkit jupyterlab matplotlib seaborn
pip install xgboost

Operation

Prediction pipeline. To use the data preparation pipeline you have to be in the repository folder and your conda environment have to be activated. Additionally the Marvin10 commandline tool cxcalc and the QUACPAC28 commandline tool tautomers have to be set in your PATH variable.

Also the environment variables OE_LICENSE (containing the path to your OpenEye license file) and JAVA_HOME (referring to the Java installation folder, which is needed for cxcalc) have to be set.

After preparation you can display a small usage information with bash run_pipeline.sh -h. Examplary call:

bash run_pipeline.sh --train datasets/chembl25.sdf --test datasets/novartis_cleaned_mono_unique_notraindata.sdf

Prediction tool. First of all you have to be in the repository folder and your conda environment have to be activated. To use the prediction tool you have to retrain the machine learning model. Therefore just call the training script, it will train the 5-fold cross-validated Random Forest machine learning model using 12 cpu cores. If you want to adjust the number of cores you can edit the train_model.py by changing the value of the variable EST_JOBS.

python train_model.py

To use the prediction tool with the trained model QUACPAC/Tautomers have to be available as it was mentioned in the section above.

Now you can call the python script with an SDF file and an output path:

python predict_sdf.py my_test_file.sdf my_output_file.sdf

It should be noted that this model was built for monoprotic structures regarding a pH range of 2 to 12. If the model is used with multiprotic structures, the predicted values will probably not be correct.

Results

The statistics for a five-fold cross-validation are given in Table 1. In terms of the mean absolute error, a random forest with scaled MorganFeatures (radius=3) and descriptors gives the best performing model (MAE=0.682, RMSE=1.032, r2=0.82). For the two external test sets (see Table 2), a random forest with FeatureMorgan (radius=3) gives the best model (Novartis: MAE=1.147, RMSE=1.513, r2=0.569; LiteratureCompilation: MAE=0.532, RMSE=0.785, r2=0.889). The predictive performance for Marvin10 and the OPERA tool15 are as follows: Novartis (Marvin: MAE=0.856, RMSE=1.166, r2=0.744; OPERA: MAE= 2.274, RMSE= 3.059, r2= -0.754) and LiteratureComplation2024 (Marvin: MAE= 0.566, RMSE= 0.865, r2= 0.866; OPERA: MAE= 1.737, RMSE= 2.182, r2= 0.124). This shows that our model slightly outcompetes Marvin for the LiteratureCompilation, but Marvin performs better for the Novartis dataset. For both data sets, our models17 have a better predictive performance than the OPERA tool.

Table 1. Statistics of the five-fold cross-validation of the machine learning models.

The two best and worst performing models are highlighted in green and red. For those neural networks where the values were specified as "not available" (“#NA”), the weights could not be optimized properly due to the large value range of the RDKit descriptors, so training failed here.

Cross-Validation
Modell (seed=24)Train ConfigurationMAE
(mean)
MAE
(std)
RMSE
(mean)
RMSE
(std)
R2
(mean)
R2
(std)
Random Forest (n_estimators=1000) Desc (196 RDKit) 0,7180,0221,0770,0210,8040,01
FCFP6 (4096 bit) 0,7080,0211,0940,0290,7970,008
Desc + FCFP6 0,6830,0171,0320,0130,820,005
Desc (196 RDKit) (scaled) 0,7170,0221,0760,0220,8040,011
FCFP6 (4096 bit) (scaled) 0,7080,0211,0940,0290,7970,008
Desc + FCFP6 (scaled) 0,6820,0171,0320,0130,820,005
Support Vector Machine Desc (196 RDKit) 2,10,0372,4360,035-0,0040,004
FCFP6 (4096 bit) 0,8510,0251,240,0350,740,012
Desc + FCFP6 2,10,0372,4360,035-0,0040,004
Desc (196 RDKit) (scaled) 0,8760,0331,2820,0470,7220,015
FCFP6 (4096 bit) (scaled) 1,090,0341,4660,0410,6370,014
Desc + FCFP6 (scaled) 1,020,0371,40,0470,6680,016
Support Vector Machine (gamma='auto') Desc (196 RDKit) 2,0160,0422,3620,0390,0560,009
FCFP6 (4096 bit) 1,6120,0311,9260,0330,3730,007
Desc + FCFP6 1,6420,0612,0520,060,2880,027
Desc (196 RDKit) (scaled) 0,8820,0351,2880,0480,7190,016
FCFP6 (4096 bit) (scaled) 1,090,0341,4650,0410,6370,014
Desc + FCFP6 (scaled) 1,0190,0371,40,0470,6690,016
Multilayer Perceptron (hidden_layer_sizes=(500, 500)) Desc (196 RDKit) #NA#NA#NA#NA#NA#NA
FCFP6 (4096 bit) 0,8660,0251,270,0470,7270,019
Desc + FCFP6 #NA#NA#NA#NA#NA#NA
Desc (196 RDKit) (scaled) 0,7260,0181,1020,050,7940,022
FCFP6 (4096 bit) (scaled) 1,0370,0451,4570,0570,640,024
Desc + FCFP6 (scaled) 0,9680,0321,3830,040,6770,014
Multilayer Perceptron (hidden_layer_sizes=(500, 500),
early_stopping=True)
Desc (196 RDKit) #NA#NA#NA#NA#NA#NA
FCFP6 (4096 bit) 0,8940,0241,2970,040,7150,016
Desc + FCFP6 #NA#NA#NA#NA#NA#NA
Desc (196 RDKit) (scaled) 0,7680,0341,1610,090,770,038
FCFP6 (4096 bit) (scaled) 1,0310,0371,4470,0570,6450,026
Desc + FCFP6 (scaled) 0,9840,0291,4040,0350,6660,017
Multilayer Perceptron (hidden_layer_sizes=(250, 250, 250),
early_stopping=True)
Desc (196 RDKit) #NA#NA#NA#NA#NA#NA
FCFP6 (4096 bit) 0,8690,0231,2650,0390,7290,016
Desc + FCFP6 #NA#NA#NA#NA#NA#NA
Desc (196 RDKit) (scaled) 0,7750,0081,1580,0330,7730,013
FCFP6 (4096 bit) (scaled) 1,0260,0381,4550,0530,6420,022
Desc + FCFP6 (scaled) 0,9730,0351,3880,0530,6740,023
XGBoost Desc (196 RDKit) 1,020,0141,3530,0210,6910,007
FCFP6 (4096 bit) 1,0940,0271,4230,0360,6570,011
Desc + FCFP6 1,0180,011,3460,0220,6940,005
Desc (196 RDKit) (scaled) 1,020,0141,3530,0210,6910,007
FCFP6 (4096 bit) (scaled) 1,0940,0271,4230,0360,6570,011
Desc + FCFP6 (scaled) 1,0180,011,3460,0220,6940,005

MAE, mean absolute error; RMSE, root mean square error.

Table 2. Predictive performance of the machine learning models the on the two external test sets.

The two best and worst performing models are highlighted in green and red. For those neural networks where the values were specified as "not available" (“#NA”), the weights could not be optimized properly due to the large value range of the RDKit descriptors, so training failed here.

NovartisAvLiLuMoVe
Modell (seed=24)Train ConfigurationMAERMSER2MAERMSER2
Random Forest (n_estimators=1000)Desc (196 RDKit)1,2591,6070,5130,6890,9790,828
FCFP6 (4096 bit)1,1471,5130,5690,5320,7850,889
Desc + FCFP61,21,5320,5580,6280,8840,86
Desc (196 RDKit) (scaled)1,2591,6070,5130,6880,9790,828
FCFP6 (4096 bit) (scaled)1,1471,5130,5690,5320,7850,889
Desc + FCFP6 (scaled)1,1981,5310,5580,6280,8840,86
Support Vector MachineDesc (196 RDKit)2,1772,451-0,1322,182,441-0,07
FCFP6 (4096 bit)1,4231,7320,4350,6880,9810,827
Desc + FCFP62,1772,451-0,1322,182,441-0,07
Desc (196 RDKit) (scaled)1,3821,7350,4330,7721,0580,799
FCFP6 (4096 bit) (scaled)1,7712,0350,2191,1151,4220,637
Desc + FCFP6 (scaled)1,7462,0150,2351,0441,3450,675
Support Vector Machine (gamma='auto')Desc (196 RDKit)2,1622,428-0,1111,9212,2420,097
FCFP6 (4096 bit)1,6861,9320,2971,4291,670,499
Desc + FCFP62,1612,442-0,1241,6112,0040,279
Desc (196 RDKit) (scaled)1,3781,7320,4350,7661,0490,802
FCFP6 (4096 bit) (scaled)1,772,0340,221,1141,4210,637
Desc + FCFP6 (scaled)1,7442,0130,2361,0431,3430,676
Multilayer Perceptron (hidden_layer_sizes=(500, 500)) Desc (196 RDKit)#NV#NV#NV#NV#NV#NV
FCFP6 (4096 bit)1,4141,7730,4070,8521,1690,755
Desc + FCFP6#NV#NV#NV#NV#NV#NV
Desc (196 RDKit) (scaled)1,3181,6340,4970,6880,9420,841
FCFP6 (4096 bit) (scaled)1,6272,0330,2211,1021,5690,558
Desc + FCFP6 (scaled)1,5421,9410,291,0011,4270,634
Multilayer Perceptron (hidden_layer_sizes=(500, 500), early_stopping=True)Desc (196 RDKit)#NV#NV#NV#NV#NV#NV
FCFP6 (4096 bit)1,4041,7720,4080,8461,1540,761
Desc + FCFP6#NV#NV#NV#NV#NV#NV
Desc (196 RDKit) (scaled)1,2981,6260,5020,7010,9360,843
FCFP6 (4096 bit) (scaled)1,6112,0280,2251,1411,5750,554
Desc + FCFP6 (scaled)1,6051,9980,2480,9871,3650,665
Multilayer Perceptron (hidden_layer_sizes=(250, 250, 250), early_stopping=True)Desc (196 RDKit)#NV#NV#NV#NV#NV#NV
FCFP6 (4096 bit)1,3631,7170,4450,861,1640,757
Desc + FCFP6#NV#NV#NV#NV#NV#NV
Desc (196 RDKit) (scaled)1,3541,7050,4520,7771,0570,799
FCFP6 (4096 bit) (scaled)1,5841,9890,2541,0531,4680,613
Desc + FCFP6 (scaled)1,5811,9630,2740,9531,3520,672
XGBoostDesc (196 RDKit)1,3670,4531,7040,8190,8061,04
FCFP6 (4096 bit)1,280,5031,6240,7820,8230,992
Desc + FCFP61,2930,4951,6370,7740,8220,995
Desc (196 RDKit) (scaled)1,3670,4531,7040,8190,8061,04
FCFP6 (4096 bit) (scaled)1,280,5031,6240,7820,8230,992
Desc + FCFP6 (scaled)1,2930,4951,6370,7740,8220,995
ChemAxon Marvin (V20.1.0)0,8561,1660,7440,5660,8650,866
OPERA (V2.5)*2,2743,059-0,7541,7372,1820,124

*For OPERA 6 molecules from AvLiLuMoVe and 31 molecules from Novartis were left out because OPERA predicted either two or zero pKa values.

MAE, mean absolute error; RMSE, root mean square error.

Discussion and conclusions

The good performance of Marvin on the Novartis set is interesting to note: the RMSE is almost 0.4 units better than our top performing model. This could be because Marvin’s training set is much larger than our own training set. This provides a better foundation for the training of the Marvin model. In contrast, Marvin performs slightly worse than our top model on the LiteratureCompilation. The OPERA tool performs significantly worse than our model on both external test sets. We assume that the addition of 2470 ChEMBL pKa-datapoints to our training set which are not part of the OPERA training set leads to this drop in predictive performance. In addition, the pre-processing of the data is performed differently by OPERA in comparison to our pre-processing procedure.

As next step for the enhancement and improvement of our pKa prediction model17, we are currently expanding it to multiprotic molecules. We are also investigating the impact of different neural net architectures and types (such as graph neural nets) and the development of individual models for acids and bases. From a chemistry perspective, an analysis of pKa effects of different functional groups (e.g. by means of matched molecular pairs analysis) is an on-going effort for a future publication.

Data availability

Source data

Zenodo: czodrowskilab/Machine-learning-meets-pKa article. https://doi.org/10.5281/zenodo.366224517.

The following data sets were used in this study:

  • AvLiLuMoVe.sdf - Manually combined literature pKa data.

  • chembl25.sdf - Experimental pKa data extracted from ChEMBL25.

  • datawarrior.sdf - pKa data shipped with DataWarrior.

  • combined_training_datasets_unique.sdf - Preprocessed and combined data from datasets chembl25.sdf and datawarrior.sdf, used as training dataset.

  • AvLiLuMoVe_cleaned_mono_unique_notraindata.sdf - used as external testset.

  • novartis_cleaned_mono_unique_notraindata.sdf - inhouse dataset provided by Novartis19, used as external testset.

The data sets are also available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.

License: MIT license.

Software availability

The source code is available at https://github.com/czodrowskilab/Machine-learning-meets-pKa.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.366224517.

License: MIT license.

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 27 Apr 2020
Revised
Version 1
VERSION 1 PUBLISHED 13 Feb 2020
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 12 Mar 2020
    Paul Czodrowski, Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Strasse 6, 44227 Dortmund, Germany
    12 Mar 2020
    Author Response
    Thanks for the comments, we will take a closer look at the distinction between acidic and basic pKa and the different OPERA models in a follow-up publication.
     
    Currently, a check of ... Continue reading
  • Reader Comment 20 Feb 2020
    Kamel Mansouri, ILS, USA
    20 Feb 2020
    Reader Comment
    This is a good attempt to model pKa which is an important parameter to predict. This type of work is also a good addition to the existing free and open-source ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Baltruschat M and Czodrowski P. Machine learning meets pKa [version 1; peer review: 2 approved]. F1000Research 2020, 9(Chem Inf Sci):113 (https://doi.org/10.12688/f1000research.22090.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 13 Feb 2020
Views
25
Cite
Reviewer Report 24 Mar 2020
Johannes Kirchmair, Department of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, Vienna, Austria 
Approved
VIEWS 25
Baltruschat and Czodrowski report on the development of a Python-based tool for the prediction of pka values. The tool is made available open source and will certainly be useful to the scientific community.
  • Could the authors
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kirchmair J. Reviewer Report For: Machine learning meets pKa [version 1; peer review: 2 approved]. F1000Research 2020, 9(Chem Inf Sci):113 (https://doi.org/10.5256/f1000research.24362.r61511)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 27 Apr 2020
    Paul Czodrowski, Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Strasse 6, 44227 Dortmund, Germany
    27 Apr 2020
    Author Response
    Could the authors please comment on the structural relationship between the training and the test data (how far apart are these datasets, e.g. measured by the distribution of pairwise Tanimoto ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 27 Apr 2020
    Paul Czodrowski, Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Strasse 6, 44227 Dortmund, Germany
    27 Apr 2020
    Author Response
    Could the authors please comment on the structural relationship between the training and the test data (how far apart are these datasets, e.g. measured by the distribution of pairwise Tanimoto ... Continue reading
Views
33
Cite
Reviewer Report 25 Feb 2020
Ruth Brenk, Department of Biomedicine, University of Bergen, Bergen, Norway 
Approved
VIEWS 33
The article describes a machine learning method for pkA prediction. The presented method and results are scientifically solid. Therefore, I recommend the article for indexing. However, the methods and results should be presented more clearly before the article is indexed.
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Brenk R. Reviewer Report For: Machine learning meets pKa [version 1; peer review: 2 approved]. F1000Research 2020, 9(Chem Inf Sci):113 (https://doi.org/10.5256/f1000research.24362.r60011)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 27 Apr 2020
    Paul Czodrowski, Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Strasse 6, 44227 Dortmund, Germany
    27 Apr 2020
    Author Response
    1. Past tense should be used consistently to describe the methods and results.We have used past tense throughout the entire document.
    2. The “different experimental methods” should be moved
    ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 27 Apr 2020
    Paul Czodrowski, Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Strasse 6, 44227 Dortmund, Germany
    27 Apr 2020
    Author Response
    1. Past tense should be used consistently to describe the methods and results.We have used past tense throughout the entire document.
    2. The “different experimental methods” should be moved
    ... Continue reading

Comments on this article Comments (2)

Version 2
VERSION 2 PUBLISHED 27 Apr 2020
Revised
Version 1
VERSION 1 PUBLISHED 13 Feb 2020
Discussion is closed on this version, please comment on the latest version above.
  • Author Response 12 Mar 2020
    Paul Czodrowski, Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Strasse 6, 44227 Dortmund, Germany
    12 Mar 2020
    Author Response
    Thanks for the comments, we will take a closer look at the distinction between acidic and basic pKa and the different OPERA models in a follow-up publication.
     
    Currently, a check of ... Continue reading
  • Reader Comment 20 Feb 2020
    Kamel Mansouri, ILS, USA
    20 Feb 2020
    Reader Comment
    This is a good attempt to model pKa which is an important parameter to predict. This type of work is also a good addition to the existing free and open-source ... Continue reading
  • Discussion is closed on this version, please comment on the latest version above.
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.