Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.130936.2

Research Article

Articles

Optimization of binding affinities in chemical space with generative pre-trained transformer and deep reinforcement learning

[version 2; peer review: 3 approved, 2 approved with reservations]

Xiaopeng

Conceptualization Data Curation Formal Analysis Investigation Methodology Project Administration Resources Software Validation Visualization Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0003-2414-7851 1 2 Zhou

Juexiao

Conceptualization Visualization Writing – Original Draft Preparation Writing – Review & Editing 1 2 Zhu

Chen

Conceptualization Visualization Writing – Original Draft Preparation Writing – Review & Editing 3 Zhan

Qing

Conceptualization Formal Analysis Writing – Review & Editing 1 2 Li

Zhongxiao

Conceptualization Methodology Writing – Review & Editing https://orcid.org/0000-0003-2480-0750 1 2 Zhang

Ruochi

Conceptualization Resources Writing – Review & Editing https://orcid.org/0000-0001-6541-4050 4 Wang

Conceptualization Resources Writing – Review & Editing 4 Liao

Xingyu

Conceptualization Methodology Writing – Review & Editing 1 2 Gao

Xin

Conceptualization Funding Acquisition Supervision Writing – Review & Editing a 1 2 1Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia 2Computer, Electrical and Mathematical Sciences and Engineering (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia 3KAUST Catalysis Center (KCC), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia 4Syneron Technology, Guangzhou, China

a xin.gao@kaust.edu.sa

No competing interests were disclosed.

20 2 2024

2023

757

15 2 2024

2024

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

The key challenge in drug discovery is to discover novel compounds with desirable properties. Among the properties, binding affinity to a target is one of the prerequisites and usually evaluated by molecular docking or quantitative structure activity relationship (QSAR) models.

Methods

In this study, we developed SGPT-RL, which uses a generative pre-trained transformer (GPT) as the policy network of the reinforcement learning (RL) agent to optimize the binding affinity to a target. SGPT-RL was evaluated on the Moses distribution learning benchmark and two goal-directed generation tasks, with Dopamine Receptor D2 (DRD2) and Angiotensin-Converting Enzyme 2 (ACE2) as the targets. Both QSAR model and molecular docking were implemented as the optimization goals in the tasks. The popular Reinvent method was used as the baseline for comparison.

Results

The results on the Moses benchmark showed that SGPT-RL learned good property distributions and generated molecules with high validity and novelty. On the two goal-directed generation tasks, both SGPT-RL and Reinvent were able to generate valid molecules with improved target scores. The SGPT-RL method achieved better results than Reinvent on the ACE2 task, where molecular docking was used as the optimization goal. Further analysis shows that SGPT-RL learned conserved scaffold patterns during exploration.

Conclusions

The superior performance of SGPT-RL in the ACE2 task indicates that it can be applied to the virtual screening process where molecular docking is widely used as the criteria. Besides, the scaffold patterns learned by SGPT-RL during the exploration process can assist chemists to better design and discover novel lead candidates.

Drug design transformers reinforcement learning molecular docking hit discovery

King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA)

FCC/1/1976-44-01

FCC/1/1976-45-01

URF/1/4663-01-01

REI/1/5202-01-01

REI/1/4940-01-01

RGC/3/4816-01-01

This work was supported by the grants assigned to Prof. Xin Gao from the King Abdullah University of Science and Technology (KAUST) Office of Research Administration (ORA) under Award No FCC/1/1976-44-01, FCC/1/1976-45-01, URF/1/4663-01-01, REI/1/5202-01-01, REI/1/4940-01-01, and RGC/3/4816-01-01.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Revised Amendments from Version 1

Changes made from version 1 to version 2:

The repetitive explanations of abbreviations in abstract, figure legends, and table legends were removed as mentioned by the reviewers.

Included property distributions changes in Supplementary Figures into the Figures 3-4, to make the presentation clearer as mentioned by the reviewers.

Updated the Supplementary Figures accordingly to support the changes in 2.

Updated the source data reference to follow the update in 3.

Corrected several typos and removed unnecessary sentences to make the context more fluent to read, as mentioned by the reviewers.

Added descriptions to clarify the QSAR processing, as mentioned by a reviewer.

Added a citation as suggested by a reviewer.

Added descriptions to describe the formulation of the optimization as a RL problem.

Added explanations of abbreviations in the figure and table captions to make them easier to read.

Renamed subsections references to use names instead of numbers.

Introduction

The key challenge in drug discovery is to discover new molecules with desirable properties. ¹ In traditional drug discovery campaigns, high-throughput virtual screening, biochemical assays, physicochemical assays, and in vitro profiling of absorption, distribution, metabolism, and excretion (ADME) properties of chemicals are usually conducted. ² However, the chemical space of possible molecules is enormous, with 10 ²³ to 10 ⁶⁰ potential drug-like molecules and the number of synthesized molecules in the order of 10 ⁸. ³ It is infeasible to screen all the molecules to select the desirable ones. Many machine learning tools to predict molecular properties, including binding affinity, drug-likeness, synthetic accessibility, and ADME properties have been integrated into the screening pipelines as key components, ⁴ as they are much faster than traditional computational methods and yield rapid and accurate property predictions. ³ ^, ⁵ Employing these tools has improved the efficiency to virtually screen the chemical libraries, which are generated from available chemical reagents. ⁶ ^, ⁷ However, the search is still limited to molecules in the chemical libraries.

In recent years, de novo molecular design, especially deep generative models, has witnessed a rapid progress, which can efficiently explore the chemical space and optimize the molecular generation towards desired properties. ³ ^, ⁸ ^– ¹⁰ A pioneer work was published in 2018, which employed variational autoencoder (VAE) to learn a continuous representation of the chemical space and used gradient-based optimization to search for functional molecules. ¹¹ After that, many methods were developed and the most representative classes include recurrent neural networks, autoencoders, generative adversarial networks, and reinforcement learning (RL). ³ ^, ⁴ Among them, RL methods were shown to be able to optimize the generation of molecules towards desirable properties, including target activity, drug-likeness, molecular weight, synthetic accessibility (SA), and similarity to given molecules. ⁴ ^, ⁶ ^, ¹² ^, ¹³

Transformer ¹⁴ is a prominent deep learning method that was first proposed for natural language translation and has made tremendous impact in many fields, such as language modeling, speech processing, and computer vision. ¹⁵ A decoder-only variant of the transformer, Generative Pretrained Transformer (GPT), stands out among the many transformer variants. It was trained on a large corpus of unlabeled text and able to generate news articles difficult for human evaluators to differentiate from human-written ones. ¹⁶ ^, ¹⁷ Besides, a GPT model fine-tuned with reinforcement learning showed better generative results, with reduced toxic outputs and better truthfulness. ¹⁸

Several transformer-based methods have been proposed for molecular generation tasks. ⁴ ^, ¹⁹ ^– ²¹ A study formulated the protein-specific molecular generation as a machine translation problem and used amino acid sequences as inputs and simplified molecular input line entry system (SMILES) representation of molecules as outputs. ¹⁹ The model was pretrained on amino acid sequences of targets and the corresponding SMILES of the binding molecules, and able to generate valid molecules with structural novelty and plausible drug-likeness. Another work also formulated molecular generation as a translation problem, but their goal is to optimize the generation of molecules towards desirable properties. ²¹ They added a desirable property together with the starting molecules as the input and the modified molecules fulfilling the desirable property as the output to train their model. Their results showed that transformers can generate molecules with desirable properties through modifications that are intuitive to chemists. A decoder-only transformer model, MolGPT, was also proposed for molecular generation. ²⁰ It was trained on molecules with property conditions and able to generate novel molecules fulfilling the corresponding properties. Another work also used a decoder-only transformer model but targeting multiple properties. ⁴ After pretraining a transformer model, a gated recurrent unit (GRU) model was used to distill it and initiate an RL agent. This agent was then trained to optimize multiple properties through the Reinvent approach. ¹³ The agent can generate novel molecules satisfying multiple property constraints. In summary, these studies showed the advantages of transformers on molecular generation, especially for constrained generation tasks. ⁴ ^, ¹²

Activity of a compound is the primary consideration for drug discovery, which is induced by binding affinity of a compound to a target. Three approaches are used to estimate binding affinity, including bioassays, quantitative structure activity relationship (QSAR) models and molecular docking. ²² In vitro bioassays are reliable but often scarce, and QSAR models and molecular docking are usually used for in silico screening process. ²² Because transformers are so good at sequence generation and RL has an advantage on optimization tasks, an intuitive idea is to combine transformer and RL to optimize the binding affinity. However, as far as we know, no such studies have been conducted. Two main obstacles may stop researchers from conducting such studies. First, high-end GPUs with large memories are required to conduct such studies. During the RL process, a transformer decoder has to be used to generate a batch of molecules, however, such generation is very memory expensive. Besides, conducting such studies requires interdisciplinary knowledge, including computational chemistry and machine learning expertise. For example, molecular docking is usually used for virtual screening, but is not easy for machine learning experts to perform and interpret; while transformer and RL are widely used in deep learning society, but are hard for computational chemists to grasp and implement.

In this study, we proposed the first method that combines GPT and RL for molecular generation. We developed a tool named SGPT-RL, which uses a transformer decoder as the policy network of RL agents. The workflow is shown in Figure 1. First, GPT was trained on lead-like molecules to obtain a prior model that learns the chemical space. This prior model was used to initiate the agent, which shared the same decoder model as the policy network. Then, the agent was trained in an RL fashion to optimize the generation of molecules towards desirable properties, as shown in Figure 1c. The agent was used to generate a batch of molecules; the molecules were scored by scoring functions to obtain the target scores; the scores were combined with the prior likelihoods to calculate the losses; the losses that contain both the target score and prior likelihood information were used to serve as the feedback to the agent. During training, the likelihood of the agent to generate molecules with good target scores is increased and those with poor scores decreased. We evaluated SGPT-RL on the Moses distribution learning benchmark and two goal-directed generation tasks. Results on the Moses benchmark showed that the SGPT-RL prior model was able to learn good property distributions and generate molecules with high novelty. The two goal-directed generation tasks are a Dopamine Receptor D2 (DRD2) task, with QSAR model-based activity as the scoring function, and an Angiotensin-Converting Enzyme 2 (ACE2) task, with molecular docking affinity as the target score. In both tasks, the SGPT-RL agents were able to generate valid molecules with high target activities. In the DRD2 task, the SGPT-RL agent was able to explore more scaffolds than the popular Reinvent method; in the ACE2 task, the SGPT-RL agent generated molecules with significantly better docking scores than Reinvent. Besides, we found that the Reinvent agents could not learn effectively after around 100 steps, while the SGPT-RL agents were continuous learning and generating molecules with more ring structures. In addition, we found that the SGPT-RL agents were able to learn some generative patterns, while the Reinvent agents were exploring with strong randomness and no clear patterns could be observed.

Figure 1. The workflow of SGPT-RL.

a) The main workflow. Simplified molecular input line entry system (SMILES) from the Moses benchmark was used to train a prior model. An agent model was then initiated from the prior and trained in a reinforcement learning (RL) fashion to generate molecules with desirable properties. b) The architecture of the prior model. The agent shares the same architecture. c) The pipeline of the RL approach. The prior model was used to initiate the agent model. During each RL step, the agent model was used to generate a batch of SMILES sequences. The generated sequences were evaluated by the prior model and a scoring function to calculate augmented likelihoods, which serve as the feedback to update the agent. In the Dopamine Receptor D2 (DRD2) task, a quantitative structure activity relationship (QSAR) model was used as the scoring function; in the Angiotensin-Converting Enzyme 2 (ACE2) task, ACE2 docking score was used as the scoring function.

Methods Datasets

The dataset to train the prior models was obtained from the Moses benchmark. ²³ ^, ⁴² This dataset contains 1.9 million lead-like molecules from the Zinc database. ²⁴ The train and test dataset in the Moses benchmark were used for training and testing, which contain 1,584,664 and 176,075 molecules respectively.

Known active molecules that bind with DRD2 or ACE2 were obtained from ExCAPE-DB. ²⁵ ^, ⁴² The 8,036 unique molecules that are known to be active against DRD2 were obtained and 56 unique molecules that are active against ACE2 were retrieved. For these two sets of known active molecules, none of them were found in the Moses training dataset.

Model architecture

A brief overview of the framework is illustrated in Figure 1a. A transformer decoder prior model was trained on the Moses dataset. This pretrained prior model was used to initiate the agent. During the RL process, the agent model was used to generate molecules, which were scored by the prior network and a scoring function to provide feedback to update the agent. The agent model trained after the final step was used to generate molecules for property distribution analysis.

The prior network

In SGPT-RL, a generative pre-trained transformer (GPT) ²⁶ was used as the prior model to learn the chemical space. Tokenized SMILES sequences were used to train the model on a next token prediction task.

The GPT model we used is a simplified version of GPT-2, with only ∼6M parameters. The architecture of the model is illustrated in Figure 1b. The model is composed of eight decoder blocks, input and positional embedding before the blocks, a linear layer after the blocks, and a softmax layer before output. Each of the blocks contains a masked multi-head self-attention layer and a fully connected feedforward layer, with residual connections in each of the layers. Layer normalization is conducted in the two layers to normalize the inputs. An embedding size of 256 was used in all layers.

The core of the GPT model is the masked multi-head self-attention layer. In this layer, eight scaled dot-product attention functions facilitate the model to capture key information in a sequence. In the attention function, a query vector Q is used to calculate a dot product with the key vector K and then divided by the key vector length d k . The resulting product value is passed into a softmax function to get the attention weights, which is dot-producted with a value vector V to get the final attention. The formula is shown in Equation 1. ¹⁴ Attention Q K V = softmax Q K T d k V (1)

The prior model was trained for ten epochs on the training dataset and evaluated on the testing dataset after each epoch. Cross-entropy loss was used with the AdamW optimizer ²⁷ to update the model, with a learning rate of 0.001. A batch size of 1,024 was used to train the model. To generate the SMILES string of a molecule, a start token was fed to the model to predict the next. The generated token was concatenated with previous tokens to predict the next, until an end token was predicted or a maximum sequence length of 140 was reached.

Training the agent

The process to generate molecules with desirable properties was framed as a RL problem, and the Reinvent approach was utilized, with the process described below. ¹² In the RL formulation, the state is the current sequence generated, the action is the next token to add, and the reward is a augmented likelihood calculated from prior likelihood and property scores. The GPT model described in the previous Subsection was used for the prior and the agent, and customized scoring functions for the target properties were used in each of the two tasks.

The loss function to update the agent model is defined as in Equations 2– 3. First, a SMILES sequence A was sampled from the agent model with its log-likelihood log p A agent . Then the SMILES sequence was passed to the prior model to calculate a prior log-likelihood log p A prior , and evaluated with scoring functions of desirable properties to get a score S A . The score was added to the prior log-likelihood with a coefficient σ to get an augmented log-likelihood log p A aug , as shown in Equation 2. The idea behind this equation is that the prior log-likelihood is added to preserve the rules learnt from SMILES sequences of molecules, and the score of desirable properties was used to bias the model to generate SMILES of desirable properties. log p A aug = log p A prior + σ S A (2)

Finally, the squared error between the augmented log-likelihood and agent log-likelihood was used as the loss to update the agent model, as shown in Equation 3. Loss = log p A aug − log p A agent 2 (3)

Evaluation metrics

Five metrics from the Moses benchmark were used to evaluate the models, including validity, uniqueness, novelty, similarity to a nearest neighbor (SNN) and internal diversity (intDiv). The definitions of the metrics are described below. The generated SMILES sequences to be evaluated are denoted by G, the training dataset is denoted by T, and n is the total number of the generated sequences. •

Validity: the fraction of the valid sequences among 10,000 generated sequences.

•

Uniqueness: the fraction of the unique sequences among 10,000 valid generated sequences.

•

Novelty: the fraction of the unique sequences in G, but not in T.

•

Similarity to a nearest neighbor (SNN): evaluates the similarity of the generated molecules to the training molecules. It is the Tanimoto similarity T m G m T between fingerprints of a molecule m G from the generated set G and its nearest neighbor molecule m T in the training dataset.

SNN G T = 1 n ∑ m G ∈ G max m T ∈ T T m G m T (4)

•

Internal diversity (intDiv): assesses the diversity within G. It is defined as one minus the averaged Tanimoto similarity of any pair of molecules m 1 , m 2 in the generated sequences G.

IntDiv G = 1 − 1 n 2 ∑ m 1 , m 2 ∈ G T m 1 m 2 (5)

Evaluated molecular properties

In our experiments, seven molecular properties were calculated to evaluate the property distributions and used as the optimization goals. All these properties were used to compare the property distributions of molecules. DRD2 activity and ACE2 docking score were used as the scoring functions of the DRD2 and ACE2 tasks, respectively.

DRD2 activity was evaluated with a QSAR model. ¹² This model is a support vector machine (SVM) classifier with a Gaussian kernel trained on active and inactive molecules. In the modeling, a SMILES is converted into molecules to obtain the Morgan fingerprints using RDKit 2017.09.1. ²⁸ The fingerprints were used as the features to build the SVM classifier. It predicts a probability score range from zero to one, with the closer to one the higher DRD2 activity.

ACE2 affinity was calculated using molecular docking as described in Subsection “ Task 2: structure-based generation with ACE2 as the target”.

The quantitative estimate of drug-likeness (QED) quantifies the drug-likeness of a molecule using molecular properties as inputs. ²⁹ It was calculated by RDKit (2017.09.1) ³⁰ and ranges from zero to one, with the closer to one the more favorable.

Synthesize accessibility score (SAscore) measures the difficulty of synthesizing a molecule. ²⁸ A predictive model built by Blaschke et al. ¹³ was used, where molecular weight was combined with raw score, ²⁸ which ranges from one to 10, as features to predict the probability of synthetic accessibility. The model gives a probability score range from zero to one, with the closer to one the better.

Molecular weight and the log of partition coefficient (LogP) were calculated using RDKit. ³⁰ Length of the SMILES string was also calculated for the molecules.

Evaluation settings

The SGPT-RL model was evaluated on a distribution learning benchmark and two tasks for goal-directed generation. The Moses Benchmark was used for distribution evaluation. DRD2 activity and ACE2 affinity were used as the scoring functions in the two goal-directed generations tasks, respectively.

Benchmarking on distribution learning

To evaluate on the Moses distribution learning benchmark, the SGPT-RL prior model was trained on Moses training dataset. The model after the final epoch was used to generate 10,000 molecules to evaluate on this benchmark. Five metrics were used for comparison, including validity, uniqueness, novelty, SNN and intDiv. The baseline models from the Moses benchmark were run with default parameters for comparison. MCMG (multi-constraints molecular generation) and MolGPT were also run with default parameters to generate 10,000 molecules for comparison.

Task 1: goal-directed generation with DRD2 as the target

In the DRD2 task, we aimed to generate molecules that are active against DRD2. The DRD2 activity predicted by a QSAR model ¹² was used as the target. The prior model trained from the Moses dataset was used to initiate the agent on this task. The agent was trained for 2,000 steps and the model after the final step was used to sample 10,000 molecules for property distribution analysis.

The Reinvent model ¹² was used as the baseline in comparison. In this agent, a three-layer GRU was used as the policy model. The default hyper-parameters of Reinvent were used. The prior model was trained for five epochs with a batch size of 128. Adam optimizer was used with a learning rate of 0.001. To train this agent, the same scoring function of the SGPT-RL agent was used for a fair comparison. The Reinvent agent was trained with a batch size of 64, a learning rate of 0.0005, a sigma of 60, and 3,000 steps.

Task 2: structure-based generation with ACE2 as the target

In the ACE2 task, we trained the SGPT-RL agent with ACE2 affinity as the desirable property. ACE2 affinity was evaluated by ligand-receptor docking experiments. The 3D structure of the human ACE2 receptor (PDB ID 1R4L) was downloaded from the Protein Data Bank. It was processed with PyMol (2.5.4) ³¹ to remove water molecules and original ligands. An open source of PyMol is available here. The structure was also processed with MGLTools (1.5.7) ³² to add polar hydrogen and obtain the docking grid. The pocket where XX5 is located was used to dock with generated molecules. The SMILES strings of generated molecules were used to generate 3D structures of ligands using RDKit (2017.09.1). ³⁰ The generated 3D ligand structures were processed with OpenBabel (3.0.0) ³³ to assign Gasteiger partial charges and convert to pdbqt format. The final docking was performed using AutoDock Vina (1.1.2) ³⁴ with eight poses for each ligand. The smallest docking score of the eight poses was used as the docking score of a ligand.

To train the agent, the affinity score was expected to be in a range of zero to one to calculate the augmented log-likelihoods. So the docking score was transformed into a range of zero to one using the reverse sigmoid function as shown in Equation 6, where l , h , and k were constants and set to be -12, -8 and 0.25, respectively. Rsigmoid x = 1 1 + 10 k ∗ x − h + l 2 h − l (6)

The Moses pretrained prior model was also used to initiate the agent on this task. The agent was trained for 1,000 steps and 64 molecules were sampled and scored during each step. 10,000 molecules were sampled from the agent model after the final step for property distribution analysis.

The Reinvent model ¹² was also used as the baseline on this task. The default hyper-parameters of Reinvent were used and the same scoring function of the SGPT-RL agent was used for comparison. This model was trained for 1,000 steps with 64 molecules generated during each step.

Scaffold analysis

To analyze the scaffold overlaps of the prior models, we clustered the scaffolds of generated molecules and training reference using Butina method in RDKit. ³⁰ ^, ³⁵ The molecules from different sources were merged, with invalid and duplicated molecules removed. Murcko Scaffolds were obtained using RDKit and clustered using Morgan fingerprints as inputs. A minimum distance of 0.2 was used during clustering. Venn diagram was used to visualize the number of overlapping clusters and unique clusters. Examples of molecules were visualized using ChemDraw 20.1. ³⁶ Some open source alternatives to ChemDraw are available here.

To analyze the average number of rings and the number of explored scaffolds in Figures 3 and 4, RDKit ³⁰ was used to obtain the Murcko Scaffold and calculate the number of rings for each generated molecule. The duplicated scaffolds were removed before counting the scaffolds.

Figure 2. Scaffold overlaps of the prior models.

a) The scaffold overlaps between the training reference and molecules generated by the SGPT-RL and Reinvent prior models. Both SGPT-RL and Reinvent were able to generate molecules with novel scaffolds that did not appear in the training reference. b) Representative molecules with unique scaffolds from the three sources. The three rows correspond to training reference only (TR), SGPT-RL prior only (SP), and Reinvent prior only (RP) molecules, respectively.

Figure 3. Comparison of SGPT-RL and Reinvent on the DRD2 task.

a-b) Improvements of validity and DRD2 activity during the RL process. SGPT-RL was relatively slower in generating molecules with good validity and DRD2 activity than Reinvent. c) Average number of rings in the generated molecules in the RL steps. SGPT-RL gradually increased the number of rings in the generated molecules during the RL process. It generated molecules with fewer rings than Reinvent in the beginning, but with more rings in the end. d) Accumulated number of unique scaffolds in the generated molecules during the RL process. SGPT-RL explored more scaffolds than Reinvent. e) The distribution of predicted DRD2 activities. Both SGPT-RL and Reinvent agents were able to generate molecules with high DRD2 activities. f) The distribution of synthesize accessibility scores (SAscore). 10,000 molecules are sampled from training dataset to be used as the reference (Training ref.).

Results Learning the chemical space with a GPT prior model

The first step of our workflow is to train a prior model to learn the chemical space. To do that, the dataset from the Moses benchmark ²³ was used to train the prior model. We used Moses dataset because the molecules in this dataset are lead-like molecules and have good chemical properties. A ∼6M GPT model was used as the prior model, details of which are described in Subsection “The prior network”. The Reinvent prior model ¹² (GRU) was trained on the same dataset for comparison. 10,000 molecules were randomly sampled from the training dataset to be used as the training reference.

A comparison of different models on the Moses distribution learning benchmark ²³ is shown in Supplementary Table 1 in Extended data. ⁴² Five Moses metrics, including validity, uniqueness, similarity to the nearest neighbor (SNN), internal diversity (IntDiv), and novelty, were selected for comparison. From the table, we found that the SGPT-RL prior model achieved a relatively good validity (0.936), uniqueness (0.997), and novelty (0.946). Though the Reinvent prior model achieved a better validity (0.986) and uniqueness (1.000), it obtained a poor novelty (0.783). The other two transformer-based methods, MCMG and MolGPT, also achieved a good novelty (0.983 and 0.931 respectively).

The property distributions of the training reference and molecules sampled from the SGPT-RL and Reinvent prior models were visualized as shown in Supplementary Figure 1 in Extended data. ⁴² Six selected properties, including DRD2 activity, ACE2 docking score, QED, synthesize accessibility score (SAscore), length of SMILES strings, and molecular weight were used for comparison. Details on the calculation of these properties are described in Subsection “Evaluated molecular properties”. From this figure, we can see that both prior models learned similar property distributions to the training reference. For molecular weight, the distribution curve of SGPT-RL prior is closer to the training reference than that of the Reinvent prior.

To compare the generative preferences of the SGPT-RL and the Reinvent prior models, we analyzed the scaffolds of the generated molecules. The overlapping scaffolds and unique scaffolds from each source were visualized using a Venn diagram as shown in Figure 2a. From this diagram, we found that both the SGPT-RL and the Reinvent prior models were able to recall scaffolds from the training reference and generate many molecules with novel scaffolds. Several examples of the generated molecules and training samples are shown in Figure 2b.

Optimizing the scores of a QSAR model through RL

In our experiments, we evaluated SGPT-RL for goal-directed generation with two tasks, a DRD2 task, which used a quantitative structure-activity relationship (QSAR) model ¹² as the scoring function, and an ACE2 task, which used a docking score calculated from AutoDock Vina ³⁴ as the scoring function.

DRD2 is one of the most well-studied drug targets, with many chemicals active against it being reported. ²⁵ ^, ³⁷ A QSAR model was proposed for DRD2 activity prediction. ¹² In this task, the SGPT-RL prior model pretrained on the Moses dataset was used to initiate the agent, and the agent was trained via RL to optimize the generation of molecules towards good DRD2 activities. The Reinvent model was trained with default hyper-parameters for comparison. ¹² Details on the training of the agents are shown in Subsection “Training the agent”. The hyper-parameter of SGPT-RL was fine-tuned as shown in Supplementary Results in Extended data. ⁴² A sigma value of 60 was chosen for this agent.

The learning curves of the agent models on the DRD2 task are shown in Figure 3. From Figures 3a-b, we see that both agents could learn a good validity and DRD2 activity after 200 steps. The Reinvent agent took fewer steps to obtain good DRD2 activity than the SGPT-RL agent. Figures 3c-d show that the SGPT-RL agent gradually increased the number of rings during generation and explored more scaffolds within the first 200 steps. The main difference in scaffold exploration between the two agents is in 100-200 steps. The Reinvent agent was not drastically improving the goal after around 100 steps, while the SGPT-RL agent was continuously learning and improving after that.

The agent models trained after the final step were also evaluated on the Moses benchmark, as shown in Table 1. The Moses metrics of MCMG was also obtained from the original paper for comparison. ⁴ We found that the SGPT-RL agent achieved better validity and novelty, while the Reinvent model obtained a better internal diversity.

Table 1. Moses metrics of the agent models on the DRD2 task.

SGPT-RL generated molecules with good validity and novelty. SNN, similarity to a nearest neighbor; IntDiv, internal diversity; MCMG, multi-constraints molecular generation.

Model	Validity	Uniqueness	SNN	IntDiv	Novelty
Reinvent	0.997	0.880	0.508	0.709	0.992
MCMG	-	0.972	0.541	0.709	0.992
SGPT-RL	0.998	0.933	0.515	0.683	0.995

The property distributions of the training reference and molecules sampled from the final SGPT-RL and Reinvent agents were also compared in this task, as shown in Figure 3e. ⁴² The properties analyzed include DRD2 activity, QED, SAscore, LogP, length of SMILES strings, and molecular weight. We found that both SGPT-RL and Reinvent could generate molecules with good DRD2 activities after the final steps, whereas the molecules in training reference have poor DRD2 activities. The property distributions of the molecules generated by the SGPT-RL and Reinvent agents are similar. Figure 3f shows that both agents shifted the SAscore distributions to the left, which means generating molecules that are relatively harder to synthesize than the molecules in the training reference.

Generating molecules to optimize docking scores

In this task, we aimed to generate novel molecules targeting ACE2, a receptor protein which SARS-CoV and SARS-CoV-2 bind to enter a cell. ³⁸ ^, ³⁹ Only 56 unique molecules were reported to be active against ACE2 in ExCAPE-DB. ²⁵ For such targets where few known active molecules are available, it is not possible to build a reliable QSAR model to predict activity. To find binding molecules against targets like ACE2, structure-based docking methods are widely used to evaluate the affinities. In this study, the ACE2 affinity of a molecule was evaluated as the minimum binding free energy calculated by AutoDock Vina. ³⁴ Details on the calculation of ACE2 affinity can be found in Subsection “Evaluated molecular properties”. The pocket, where XX5 is located, in the 3D structure of the human ACE2 receptor (PDB ID 1R4L ⁴⁰) was used to dock with a ligand. The prior model trained on Moses dataset ²³ was also used to initiate this agent, and the agent was trained for 1,000 steps. The Reinvent model was also trained on this task for a fair comparison.

The learning curves of the agent models are shown in Figure 4. The SGPT-RL agent was able to generate valid molecules with good ACE2 docking scores after 200 steps. Like the DRD2 task, in the ACE2 task the Reinvent model was not efficiently learning after around 100 steps. The docking scores of the generated molecules were not clearly improving after that. Besides, we also observed that SGPT-RL gradually increased the number of rings in the exploration process, as shown in Figure 4c. Examples of molecules generated by SGPT-RL during the initial exploration steps are shown in Figure 5. The SGPT-RL agent generated molecules with few rings in the first step, and gradually increased the number of rings. The Reinvent agent was randomly exploring the molecules, and no clear patterns can be observed, as shown in Supplementary Figure 7 in Extended data. ⁴²

Figure 4. Comparison of SGPT-RL and Reinvent on the ACE2 task.

a-b) Improvements of validity and ACE2 docking scores during the RL process. SGPT-RL generated molecules with better validity and ACE2 docking scores than Reinvent after 200 steps. c) Averaged number of rings in the generated molecules in the RL steps. SGPT-RL gradually increased the number of chemical rings of the molecules. The curve difference in c is highly correlated with the curve difference in b (Pearson’s r = 0.87). d) Accumulated number of unique scaffolds in the generated molecules during the RL process. Both SGPT-RL and Reinvent generated new scaffolds with increasing steps. e) The distribution of ACE2 docking scores. SGPT-RL shifted the distribution towards better docking scores. f) The distribution of SAscore.

Figure 5. Examples of scaffolds explored by SGPT-RL in the initial steps of the ACE2 task.

The SGPT-RL agent generated molecules with few rings in the beginning, and gradually increased the number of rings. DS, docking score.

The final agents were evaluated on the Moses metrics, as shown in Table 2. The SGPT-RL agent achieved good validity (0.990) and novelty (1.000), while Reinvent was better on SNN and internal diversity. The property distributions were plotted for the two agents. Six selected properties, including ACE2 docking score, QED, SAscore, LogP, length of SMILES string, and molecular weight, were analyzed, as shown in Supplementary Figure 8 in Extended data. ⁴² Calculations of these properties are described in Subsection “Evaluated molecular properties”. From Figure 4e, ⁴² we see that the SGPT-RL agent was able to generate molecules with good docking scores and clearly shifted the distribution curves to the left. The ACE2 docking scores of SGPT-RL generated molecules were better than the training reference or the Reinvent generated molecules. Supplementary Figure 9 in Extended data ⁴² shows some examples of molecules generated by the agents in the last step. SGPT-RL generated molecules are more similar to each other in comparison with Reinvent generated molecules. From these molecules, we can see that SGPT-RL tends to generate with certain preferences, such as a naphthalene structure in one end in this task.

Table 2. Moses metrics of the agents on the ACE2 task.

SNN, similarity to a nearest neighbor; IntDiv, internal diversity.

Model	Validity	Uniqueness	SNN	IntDiv	Novelty
Reinvent	0.875	0.987	0.560	0.816	0.976
SGPT-RL	0.990	0.986	0.466	0.797	1.000

The top six molecules with the highest docking scores generated by the agents are shown in Figure 6. The SGPT-RL agent was able to generate more molecules with high docking affinities than the Reinvent agent. Besides, five out of the top six molecules generated by SGPT-RL contain a naphthalene structure in one end. Considering the same pattern in the molecules generated by SGPT-RL in the last step, we would guess that the agent had learned such a pattern during the exploration process. However, the top scoring molecules generated by the Reinvent agent have strong randomness and no clear scaffold patterns can be observed.

Figure 6. Top scoring molecules generated in the ACE2 task.

The SGPT-RL generated molecules are more similar to each other in comparison with the Reinvent generated molecules. DS, docking score.

Discussion

In this study, we developed a tool named SGPT-RL for de novo molecular generation, which uses a transformer decoder as the policy network of the reinforcement learning (RL) agent. A workstation with two A100 GPUs was used for our experiments. The docking score was used as a scoring function in addition to a QSAR-based scoring function. This enabled us to explore not only a target with many known active molecules but also a new target with few known actives.

We evaluated SGPT-RL on two goal-directed generation tasks, a DRD2 task and an ACE2 task. As many known DRD2 actives are available, it is possible to build a reliable QSAR model to be used as the scoring function in the DRD2 task. However, few known actives were reported for ACE2, so Vina docking scores had to be used as the optimization goal in the ACE2 task. Our experiments showed that both SGPT-RL (which uses GPT as the policy network) and Reinvent (which uses GRU as the policy network) were able to propose molecules with improved scores on the two tasks. However, the SGPT-RL generated molecules showed significantly better scores on the ACE2 task compared to the Reinvent generated ones (p-value: 0.0). As the molecular docking was widely used for the virtual screening process, we believe that the superior performance of SGPT-RL in the ACE2 task would indicate its wide applicability in the practical molecular design procedure.

Besides, we found three generative differences between the SGPT-RL and Reinvent agents during the exploration steps. First, in the experiments, we found that Reinvent was exploring with strong randomness in the two tasks in general, however, SGPT-RL gradually explored the scaffolds during the generation processes. In the initial steps, SGPT-RL generated molecules with few rings and gradually increased the number of rings during exploration; in the late steps, it generated molecules with some conserved scaffold patterns, such as double ring structures in the ACE2 task. Second, we found that Reinvent was not clearly improving the goal after around 100 steps, while SGPT-RL was continuously optimizing the scores even after 400 steps. We believe that this difference is mainly caused by the difference in policy networks: it is not easy for GRU to learn ring patterns, which are represented as distant numbers in SMILES; however, GPT was able to learn long-range dependencies to remember the ring patterns that had improved scores in previous steps. Thirdly, the SGPT-RL agent could generate molecules with more rings than the Reinvent agent in the ACE2 task (shown in Figure 4c). A diverse number of rings indicates a variety of scaffold structures. Considering the importance of appropriate scaffolds in lead identification, ⁴¹ we believe that including GPT as the policy network in RL agents might be useful to discover lead candidates of novel scaffolds.

While the results of our work are noteworthy, there are two limitations to consider. First, the dataset to train the prior models would be a limit to the generative results. All the prior models were pretrained on the Moses dataset. ²³ As the Moses dataset was collected from the Zinc database, ²⁴ which mainly consists of lead-like molecules, the prior distribution could not represent the entire chemical space. The prior models were used to guide the agents in the two optimization tasks, and the bias in the prior models might contribute to the bias in the agent models. Such bias might be contributive, because it would help to generate molecules with lead-like properties, such as good synthetic accessibility and drug-likeness; however, it might also be undesirable, as it limits the chemical space the agents explored. In tasks which aim to explore out of the space of lead-like molecules, other training data should be utilized to train the prior models. Second, the settings of the docking experiments would also be a limit. We analyzed ACE2 for docking, but docking experiments of additional targets would further confirm the observations in our study.

As molecular docking was widely used for virtual screening, generative models combined with molecular docking provides another solution for the virtual screening process. The superior performance of SGPT-RL on the ACE2 task indicates that it can be applied to this practical molecular design process and propose novel molecules with good target-binding capabilities. Besides, SGPT-RL explored the chemical space with certain scaffold patterns. The patterns learned by SGPT-RL can provide intuitions for chemists to explore, thus aid the molecular design.

Data availability Underlying data

Protein Data Bank: 3D structure of the human ACE2 receptor. Accession number 1R4L; https://www.rcsb.org/structure/1R4L.

The dataset to train the prior models was obtained from the Moses benchmark. ²³ This dataset contains 1.9 million lead-like molecules from the Zinc database, and is available to readers here: https://github.com/molecularsets/moses. The train and test dataset in the Moses benchmark, used here for training and testing, contains 1,584,664 and 176,075 molecules respectively. Moses is licensed under MIT license (redistribution permitted).

The 8,036 unique molecules that are known to be active against DRD2 and 56 unique molecules that are active against ACE2 were downloaded from ExCAPE-DB , ²⁵ and which are licensed under Creative Commons Attribution 4.0 International License (redistribution permitted).

The specific underlying data used in this study been uploaded by the authors to Zenodo (see below).

Zenodo: Optimization of binding affinities in chemical space with transformer and deep reinforcement learning -- source data. https://doi.org/10.5281/zenodo.10654313. ⁴²

This project contains the following underlying data: -

Data.zip (the Moses dataset, the DRD2 and ACE2 active molecules, the pretrained models, and the source data underlying Figures 3– 4).

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Extended data

Zenodo: Optimization of binding affinities in chemical space with transformer and deep reinforcement learning -- source data https://doi.org/10.5281/zenodo.10654313. ⁴²

This project contains the following extended data: -

SGPT_SI.pdf (supplementary results, tables, and figures).

Sgpt-rl.png (the workflow of SGPT-RL).

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Software availability

Source code available from: https://github.com/charlesxu90/sgpt

Archived source code at time of publication: https://doi.org/10.5281/zenodo.7612354. ⁴³

License: MIT

References 1

Nicolaou

Brown

: Multi-objective optimization methods in drug design. Drug Discov. Today Technol. 2013;10(3):e427–e435. 10.1016/j.ddtec.2013.02.001

Hughes

Stephen Rees

Kalindjian

: Principles of early drug discovery. Br. J. Pharmacol. 2011;162(6):1239–1249. 21091654

10.1111/j.1476-5381.2010.01127.x

PMC3058157

Elton

Boukouvalas

Fuge

: Deep learning for molecular design—a review of the state of the art. Molecular Systems Design & Engineering. 2019;4(4):828–849. 10.1039/C9ME00039A

Wang

Hsieh

C-Y

Wang

: Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 2021;3(10):914–922. 10.1038/s42256-021-00403-1

Butler

Davies

Cartwright

: Machine learning for molecular and materials science. Nature. 2018;559(7715):547–555. 10.1038/s41586-018-0337-2

Ståhl

Falkman

Karlsson

: Deep reinforcement learning for multiparameter optimization in de novo drug design. J. Chem. Inf. Model. 2019;59(7):3166–3176. 10.1021/acs.jcim.9b00325

Hoffmann

Gastreich

: The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discov. Today. 2019;24(5):1148–1156. 30851414

10.1016/j.drudis.2019.02.013

Xia

Jianxing

Wang

: Graph-based generative models for de novo drug design. Drug Discov. Today Technol. 2019;32:45–53.

Vanhaelen

Lin

Y-C

Zhavoronkov

: The advent of generative chemistry. ACS Med. Chem. Lett. 2020;11(8):1496–1505. 32832015

10.1021/acsmedchemlett.0c00088

PMC7429972

Bai

Liu

Tian

: Application advances of deep learning methods for de novo drug design and molecular dynamics simulation. Wiley Interdisciplinary Reviews. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2022;12(3): e1581. 10.1002/wcms.1581

Gómez-Bombarelli

Wei

Duvenaud

: Automatic chemical design using a data-driven continuous representation of molecules. ACS central science. 2018;4(2):268–276. 29532027

10.1021/acscentsci.7b00572

PMC5833007

Olivecrona

Blaschke

Engkvist

: Molecular de-novo design through deep reinforcement learning. J. Chem. 2017;9(1):1–14. 10.1186/s13321-017-0235-x

Blaschke

Aru´s-Pous

Chen

: Reinvent 2.0: an ai tool for de novo drug design. J. Chem. Inf. Model. 2020;60(12):5918–5922. 10.1021/acs.jcim.0c00915

Vaswani

Shazeer

Parmar

: Attention is all you need. Adv. Neural Inf. Proces. Syst. 2017;30.

Lin

Wang

Liu

: A survey of transformers. arXiv preprint arXiv:2106.04554. 2021.

Radford

Narasimhan

Salimans

: Improving language understanding by generative pre-training. arXiv preprint. 2018.

Brown

Mann

Ryder

: Language models are few-shot learners. Adv. Neural Inf. Proces. Syst. 2020;33:1877–1901.

Ouyang

Jiang

: Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155. 2022.

Grechishnikova

: Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Sci. Rep. 2021;11(1):1–13. 10.1038/s41598-020-79682-4

Bagal

Aggarwal

Vinod

: Molgpt: Molecular generation using a transformer-decoder model. J. Chem. Inf. Model. 2021;62(9):2064–2076. 34694798

10.1021/acs.jcim.1c00600

You

Sandstro¨m

: Molecular optimization by capturing chemist’s intuition using deep neural networks. J. Chem. 2021;13(1):1–17. 10.1186/s13321-021-00497-0

Boitreaud

Mallet

Oliver

: Optimol: optimization of binding affinities in chemical space for drug discovery. J. Chem. Inf. Model. 2020;60(12):5658–5666. 32986426

10.1021/acs.jcim.0c00833

Polykovskiy

Zhebrak

Sanchez-Lengeling

: Molecular sets (moses): a benchmarking platform for molecular generation models. Front. Pharmacol. 2020;11:1931.

Irwin

Shoichet

: Zinc- a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 2005;45(1):177–182. 15667143

10.1021/ci049714+

PMC1360656

Sun

Jeliazkova

Chupakhin

: Excape-db: an integrated large scale dataset facilitating big data analysis in chemogenomics. J. Chem. 2017;9(1):1–9.

Radford

Jeffrey

Child

: Language models are unsupervised multitask learners. OpenAI blog. 2019;1(8):9.

Loshchilov

Hutter

: Decoupled Weight Decay Regularization. International Conference on Learning Representations. 2019.

Ertl

Schuffenhauer

: Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Chem. 2009;1(1):1–11. 10.1186/1758-2946-1-8

Richard Bickerton

Paolini

Besnard

: Quantifying the chemical beauty of drugs. Nat. Chem. 2012;4(2):90–98. 22270643

10.1038/nchem.1243

PMC3524573

Landrum

: Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. 2013.

DeLano

: Pymol: An open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr. 2002;40(1):82–92.

Morris

Huey

Lindstrom

: Autodock4 and autodocktools4: Automated docking with selective receptor flexibility. J. Comput. Chem. 2009;30(16):2785–2791. 19399780

10.1002/jcc.21256

PMC2760638

O’Boyle

Banck

James

: Open babel: An open chemical toolbox. J. Chem. 2011;3(1):1–14.

Trott

Olson

: Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010;31(2):455–461.

Butina

: Unsupervised data base clustering based on daylight’s fingerprint and tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 1999;39(4):747–750. 10.1021/ci9803381

Mills

: Chemdraw ultra 10.0 cambridgesoft, 100 cambridgepark drive, cambridge, ma 02140. 2006. commercial price: 1910fordownload, 2150 for cd-rom; academic price: 710fordownload, 800 for cd-rom. Reference Source

GeneCards: DRD2 Gene - Dopamine Receptor D2. 2022.

Zhou

Yang

X-L

Wang

X-G

: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270–273. 32015507

10.1038/s41586-020-2012-7

PMC7095418

Napolitano

Xiaopeng

Gao

: Impact of computational approaches in the fight against covid-19: an ai guided review of 17 000 studies. Brief. Bioinform. 2022;23(1):bbab456. 34788381

10.1093/bib/bbab456

PMC8689952

Towler

Staker

Prasad

: Ace2 x-ray structures reveal a large hinge-bending motion important for inhibitor binding and catalysis. J. Biol. Chem. 2004;279(17):17996–18007. 14754895

10.1074/jbc.M311191200

PMC7980034

Zhao

: Scaffold selection and scaffold hopping in lead generation: a medicinal chemistry perspective. Drug Discov. Today. 2007;12(3-4):149–155. 17275735

10.1016/j.drudis.2006.12.003

Zhou

Zhu

: Optimization of binding affinities in chemical space with generative pre-trained transformer and deep reinforcement learning -- source data (v1.2.4). Zenodo. 2023. 10.5281/zenodo.10654313

Zhou

Zhu

: Optimization of binding affinities in chemical space with generative pre-trained transformer and deep reinforcement learning -- source code (v1.2.0). Zenodo. 2023. 10.5281/zenodo.7612354

10.5256/f1000research.162639.r248667

Reviewer response for version 2

Wang

Jianmin

1 Referee https://orcid.org/0000-0001-8910-0929 1Yonsei University, Seodaemun-gu, Seoul, South Korea

Competing interests: No competing interests were disclosed.

29 2 2024

2024

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve

No further comment. Thank you for your kind responses.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

drug design, deep learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.5256/f1000research.162639.r248666

Reviewer response for version 2

Bai

Qifeng

1 Referee https://orcid.org/0000-0001-7296-6187 1Lanzhou University, Lanzhou, Gansu, China

Competing interests: No competing interests were disclosed.

23 2 2024

2024

recommendation

approve

Good work. Please accept it.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

deep learning, binding affinity and drug design,

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.5256/f1000research.162639.r248664

Reviewer response for version 2

Wong

Ka-Chun

1 Referee https://orcid.org/0000-0001-6062-733X 1Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong

Competing interests: No competing interests were disclosed.

23 2 2024

2024

recommendation

approve

The authors have responded.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

10.5256/f1000research.143734.r188009

Reviewer response for version 1

Tanrıverdi

Aslıhan Aycan

1 Referee https://orcid.org/0000-0001-5811-8253 1Kafkas University, Kars Merkez, Turkey

Competing interests: No competing interests were disclosed.

12 12 2023

2023

recommendation

approve-with-reservations

The authors published the paper entitled "Optimization of binding affinities in chemical space with generative pre-trained transformer and deep reinforcement learning [version 1; peer review: 1 approved with reservations]." The work is very comprehensive. An innovative article worth publishing. I want to congratulate the authors. There's just one point where I'm stuck.

***QSAR processing methodology should be given step by step in the methods section.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

- Polymer Synthesis and Characterization- Monomer Synthesis and Ch.- Quantum Chemistry- Molecular Modelling- Molecular Dynamic- Drug Design- Density Functional Theory- Atom in Molecules Analysis- Film Formation- Gel Formation

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Xiaopeng

King Aabdullah University of Science and Technology, Saudi Arabia

Competing interests: The authors declare no conflicts of interest.

14 2 2024

We thank the Reviewer for sharing our aims and appreciating our efforts. The issue pointed out is responded as below.

***QSAR processing methodology should be given step by step in the methods section.

We thank the Reviewer for pointing out this issue. We added descriptions to explain the QSAR processing part, as presented below.

"In the modeling, a SMILES is converted into molecules to obtain the Morgan fingerprints using RDKit. The fingerprints were used as the features to build the SVM classifier. It predicts a probability score range from zero to one, with the closer to one the higher DRD2 activity. A molecule that cannot obtain valid fingerprints was assigned with a score of zero."

10.5256/f1000research.143734.r214522

Reviewer response for version 1

Bai

Qifeng

1 Referee https://orcid.org/0000-0001-7296-6187 1Lanzhou University, Lanzhou, Gansu, China

Competing interests: No competing interests were disclosed.

12 12 2023

2023

recommendation

approve-with-reservations

In this study, authors use generative pre-trained transformer and deep reinforcement learning to optimize the binding affinities in chemical space. I have some comments as follows:

1. I have checked the source codes https://github.com/charlesxu90/sgpt. The authors give a nice description for their models. I have an install question. Why do authors repeat to install “openbabel” by command: “sudo apt-get install -y openbabel” even though Conda can install openbabel?

2. Please check equation 1. There are some kinds of attention formulas. Do authors describe the correct attention formulas for their used pre-trained models?

3. To make the affinity introduction richer, authors can add more references about binding affinities with deep learning methods such as “Bai, Q, Liu, S, Tian, Y, Xu, T, Banegas-Luna, AJ, Pérez-Sánchez, H, et al. Application advances of deep learning methods for de novo drug design and molecular dynamics simulation. WIREs Comput Mol Sci. 2022; 12:e1581. https://doi.org/10.1002/wcms.1581 “

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

deep learning, binding affinity and drug design,

Xiaopeng

King Aabdullah University of Science and Technology, Saudi Arabia

Competing interests: The authors declare no conflicts of interest.

14 2 2024

We thank the Reviewer for the summary and the helpful comments. Point by point responses to the issues are as follows.

We thank the Reviewer for looking into the code and pointing out this issue. We also want to use openbabel in Conda, however, in our experiments, we found that the default openbabel in Conda is not providing the functionality required. The default openbabel installed in our system works well. We believe this is an issue due to the distributed version of openbabel in Conda at the time of our experiments.

2. Please check equation 1. There are some kinds of attention formulas. Do authors describe the correct attention formulas for their used pre-trained models?

We thank the Reviewer for pointing out this issue. We trained a generative pre-trained transformer (GPT) from scratch to learn the prior knowledge of molecular distributions. A GPT-2 model with the multi-head self-attention mechanism was used in our model. Equation 1 describes the attention mechanism, which is the core element of it.

We added this citation as suggested by the Reviewer.

10.5256/f1000research.143734.r214547

Reviewer response for version 1

Wang

Jianmin

1 Referee https://orcid.org/0000-0001-8910-0929 1Yonsei University, Seodaemun-gu, Seoul, South Korea

Competing interests: No competing interests were disclosed.

1 11 2023

2023

recommendation

approve-with-reservations

This paper introduces a method called SGPT-RL, which utilizes GPT as the policy network in the Reinvent approach to improve the optimization of binding affinities, such as DRD2 QSAR score and ACE2 docking score. The findings of the study indicate that GPT effectively learns about the chemical space and generates compounds that are both novel and valid, which is consistent with previous research. Furthermore, GPT proves to be proficient in learning ring patterns and successfully explores various scaffolds during the exploration process in both optimization tasks. Particularly in the ACE2 task, SGPT-RL outperforms by achieving superior docking scores and identifying specific patterns, like the presence of double-ring structures.

The study shows promise overall, with GPT being a robust generative model and drug design being an important area of application for generative AI. The manuscript is well composed, but certain improvements are necessary to address a few issues.

Major issues:

In this study, the authors compared MCMG in the DRD2 task but chose not to include it in the ACE2 task. It seems more logical to compare MCMG in both tasks. However, what might be the rationale behind excluding it from the ACE2 task comparison?

The clarity of the presented results is insufficient. In my opinion, Supplementary Figure 8 effectively demonstrates the distributions and should be included in the main content for clear comprehension. Figure 5, on the other hand, would be more suitable to be relocated to the supplementary material.

Minor issues:

The manuscript is burdened with too many explanations for common abbreviations, making it a tedious read. For example, the abbreviation "SGPT-RL" is explained repeatedly in each figure, and abbreviations like "RL," "DRD2," and "ACE2" are needlessly reiterated in the captions. It would be more efficient to provide explanations for these abbreviations only when they first appear in the captions, thus avoiding unnecessary repetition.

The authors should carefully review the paper to avoid any typos and grammatical errors. Specifically, 'the' should be included before 'Moses benchmark'; 'Similarity to the nearest neighbor (SNN)' in the Subsection 'Evaluation metrics', need to be in lower case; “range in ^{1, 10}” should be “range in [1, 10]”.

Several spots are not fluent to read. For example, the first sentence in the “Model architecture” Subsection does not fit with the context and should be tuned. “see also Underlying data” doesn’t fit with the context as well.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

drug design, deep learning

Xiaopeng

King Aabdullah University of Science and Technology, Saudi Arabia

Competing interests: The authors declare no conflicts of interest

14 2 2024

We thank the Reviewer for sharing our aims and appreciating our efforts. Point by point responses to the issues are as follows.

Major issues:

1. In this study, the authors compared MCMG in the DRD2 task but chose not to include it in the ACE2 task. It seems more logical to compare MCMG in both tasks. However, what might be the rationale behind excluding it from the ACE2 task comparison?

We thank the Reviewer for pointing out this issue. Initially, we also want to compare MCMG in both tasks. However, after a careful investigation, we found it not doable. MCMG relies on a Transformer decoder, which is trained on known binding molecules, to distill the knowledge to GRU. However, in the ACE2 task, we tackled the task where no sufficient binding molecules exist. MCMG was not designed for such tasks and cannot be applied to tackle this problem.

2. The clarity of the presented results is insufficient. In my opinion, Supplementary Figure 8 effectively demonstrates the distributions and should be included in the main content for clear comprehension. Figure 5, on the other hand, would be more suitable to be relocated to the supplementary material.

We thank the Reviewer for the kind advice. We included the main subfigures from Supplementary Figure 8 into our main context to showcase the improvement of properties in the optimization process. Figure 5 illustrates the increasing number of rings in the molecules generated in the first several steps. We think it is one of the most important discoveries in the results, so we would like to keep it in the main context.

Minor issues:

1. The manuscript is burdened with too many explanations for common abbreviations, making it a tedious read. For example, the abbreviation "SGPT-RL" is explained repeatedly in each figure, and abbreviations like "RL," "DRD2," and "ACE2" are needlessly reiterated in the captions. It would be more efficient to provide explanations for these abbreviations only when they first appear in the captions, thus avoiding unnecessary repetition.

We thank the Reviewer for pointing out this issue. We updated the paragraphs and captions and removed the repeated explanations to make the sentences more fluent to read.

2. The authors should carefully review the paper to avoid any typos and grammatical errors. Specifically, 'the' should be included before 'Moses benchmark'; 'Similarity to the nearest neighbor (SNN)' in the Subsection 'Evaluation metrics', need to be in lower case; “range in1, 10” should be “range in [1, 10]”.

We thank the Reviewer for pointing out these typos and errors. We meticulously reviewed this article again, and fixed the errors and typos pointed out.

3. Several spots are not fluent to read. For example, the first sentence in the “Model architecture” Subsection does not fit with the context and should be tuned. “see also Underlying data” doesn’t fit with the context as well.

We thank the Reviewer for pointing out these spots. We updated the sentences to make them more fluent to read. Specifically, we removed the sentence “Please note that all code associated with this article is available in the Software availability section” and “see also Underlying data” within the paragraphs.

10.5256/f1000research.143734.r188001

Reviewer response for version 1

Wang

Guohua

1 Referee 1Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin, Heilongjiang, China

Competing interests: No competing interests were disclosed.

1 11 2023

2023

recommendation

approve-with-reservations

In this paper, the authors proposed SGPT-RL, a method that utilizes GPT as the policy network within the Reinvent approach to enhance the optimization of binding affinities, including DRD2 QSAR score and ACE2 docking score. The results of their study demonstrate that GPT effectively learns the chemical space, generating compounds with high novelty and validity, consistent with previous research. Notably, in both optimization tasks, GPT exhibits proficiency in learning ring patterns and successfully explores a wide range of scaffolds during the exploration process. Importantly, SGPT-RL outperforms in the ACE2 task by obtaining superior docking scores and identifying specific patterns, such as the presence of double ring structures.

Overall, this study is interesting, as GPT is the current hotspot in AI research and de novo drug design is one of the most successful cases in AI for science. The manuscript is also well written and easy to understand. But there are several issues which should be improved.

Firstly, there are an excessive number of explanations for common abbreviations in this manuscript, which makes it tedious to read. For instance, the abbreviation "SGPT-RL" is repeatedly explained in each of the figures. Similarly, abbreviations like "RL," "DRD2," and "ACE2" are unnecessarily reiterated many times in the captions. I believe it would be more effective to provide explanations for these abbreviations only during their initial occurrence in the captions, thereby avoiding repetitive explanations.

Secondly, in this study, the authors compared MCMG in the DRD2 task, but not in the ACE2 task. Wouldn't it be more natural to compare it in both tasks? What is the reason for excluding it from the comparison in the ACE2 task?

Thirdly, while going through the supplementary information, I came across Supplementary Figure 8, which serves as a clear illustration. I believe the author should incorporate it into the main content as it provides a clear explanation of the resulting distributions. Figure 5 should be relocated to the supplementary material instead.

Furthermore, the author should thoroughly proofread the paper for any typos and formatting errors. For instance, 'the' should be added before 'Moses benchmark'. In the first paragraph of Subsection 'Evaluation metrics', 'Similarity to a nearest neighbor (SNN)' should be corrected to 'similarity to a nearest neighbor (SNN)'.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Artificial intelligence in bioinformatics

Xiaopeng

King Aabdullah University of Science and Technology, Saudi Arabia

Competing interests: The authors declare no conflicts of interest

14 2 2024

We thank the Reviewer for the summary, the acknowledgement of our novelty, and for the helpful comments. Point by point responses to the issues are as follows.

We thank the Reviewer for pointing out this issue. We updated the paragraphs and captions and removed the duplicated explanations to make the sentences more fluent to read.

We thank the Reviewer for pointing out these typos and errors. We meticulously reviewed this article again, and fixed the errors and typos pointed out.

10.5256/f1000research.143734.r188006

Reviewer response for version 1

Wong

Ka-Chun

1 Referee https://orcid.org/0000-0001-6062-733X 1Department of Computer Science, City University of Hong Kong, Kowloon Tong, Kowloon, Hong Kong

Competing interests: No competing interests were disclosed.

20 7 2023

2023

recommendation

approve-with-reservations

The authors proposed a method, SGPT-RL, to optimize the SMILES sequences to improve binding affinities through incorporating GPT into a reinforcement learning (RL) framework. The authors trained a GPT model as a prior model to learn the chemical space by pretraining on Moses SMILES, and then trained two RL models, one for DRD2 QSAR scores and the other for ACE2 docking scores, to generate SMILES with good binding affinities. The results show that the GPT prior model learned a good distribution of the chemical space. The RL models were able to generate SMILES sequences with binding affinities. In addition, SGPT-RL generated sequences with better docking scores than Reinvent and able to learn certain patterns during the RL process. There are a few considerations that could be addressed:

Major issues:

1. The manuscript includes repetitive explanations of abbreviations, such as SGPT-RL, DRD2, ACE2, and SMILES, throughout the passages and captions. This not only hinders the flow of reading but also makes it tedious to navigate through. To enhance readability, I suggest minimizing the frequency of explanations and providing them only when necessary, particularly upon their initial mention.

2. GPT is mainly learning distributions, and RL is introducing inductive biases to steer the distributions towards desirable properties. Therefore, I think it is crucial to include a figure that demonstrates how the distribution of these properties evolves. Supplementary Figure 8 addresses this aspect effectively, and I recommend incorporating it into the main context to provide a clear illustration.

3. The supplementary information should have the same name as the main article, i.e. ‘transformer’ should be ‘generative pre-trained transformer’.

Minor issues:

1. In Subsection that explains ”SAscore”, the sentence “which ranges in1, 10” should be “which ranges in [1, 10]”.

2. The authors stated that SGPT-RL outperformed Reinvent on the ACE2 task with a significant p-value. However, the p-value is reported as 0.0, which appears as a numerical zero. To accurately represent this score, it would be preferable to present it in 2-digit scientific notation.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Bioinformatics

Xiaopeng

King Aabdullah University of Science and Technology, Saudi Arabia

Competing interests: The authors declare no conflicts of interest

14 2 2024

We thank the Reviewer for sharing our aims and appreciating our efforts. Point by point responses to the issues are as follows.

Major issues:

We thank the Reviewer for pointing out this issue. We updated the paragraphs and captions and removed the duplicated explanations to make the sentences more fluent to read.

We thank the Reviewer for the kind advice. We included the main results from Supplementary Figure 8 into our main context to showcase the improvement of the core properties during the optimization process.

3. The supplementary information should have the same name as the main article, i.e. ‘transformer’ should be ‘generative pre-trained transformer’.

We thank the Reviewer for pointing out this issue. We updated the title of the supplementary information to fix this issue.

Minor issues:

1. In the Subsection that explains ”SAscore”, the sentence “which ranges in1, 10” should be “which ranges in [1, 10]”.

We thank the Reviewer for pointing out this typeset issue. The sentence is updated to fix this issue as shown below.

"A predictive model built by Blaschke et al. was used, where molecular weight was combined with raw score, which ranges in from one to 10, as features to predict the probability of synthetic accessibility."

We thank the Reviewer for pointing out this issue. Ideally, we also want to have a p-value in 2-digit scientific notation, however, the result from computation is zero with no 2-digits calculated. We think this is due to the nature of distinct distribution. We use “(p-value <0.01)” as a replacement.