Ensemble machine learning modeling for the prediction of artemisinin resistance in malaria

Resistance in malaria is a growing concern affecting many areas of Sub-Saharan Africa and Southeast Asia. Since the emergence of artemisinin resistance in the late 2000s in Cambodia, research into the underlying mechanisms has been underway. The 2019 Malaria Challenge posited the task of developing computational models that address important problems in advancing the fight against malaria. The first goal was to accurately predict artemisinin drug resistance levels of Plasmodium falciparum isolates, as quantified by the IC 50. The second goal was to predict the parasite clearance rate of malaria parasite isolates based on in vitro transcriptional profiles. In this work, we develop machine learning models using novel methods for transforming isolate data and handling the tens of thousands of variables that result from these data transformation exercises. This is demonstrated by using massively parallel processing of the data vectorization for use in scalable machine learning. In addition, we show the utility of ensemble machine learning modeling for highly effective predictions of both goals of this challenge. This is demonstrated by the use of multiple machine learning algorithms combined with various scaling and normalization preprocessing steps. Then, using a voting ensemble, multiple models are combined to generate a final model prediction.

Artemisinin-based therapies are among the best treatment options for malaria caused by P. falciparum 2 . The use of artemisinin in combination with other drugs, called artemisinin combination therapies, are the best treatment options today against malaria infections.
However, emergence of artemisinin resistance in Thailand and Cambodia in 2007 has been cause for research 3 . While there are polymorphisms in the kelch domain-carrying protein K13 in P. falciparum that are known to be associated with artemisinin resistance, the underlying molecular mechanism that confers resistance remains unknown 4 . In early 2020, Birnbaum et al. discovered that the highly-conserved gene kelch13 is associated with a molecular mechanism that allows the parasite to feed on host erythrocytes by endocytosis of hemoglobin 5 . Given that artemisinin is activated by hemoglobin degradation products, these mutations can confer resistance to artemisinin.
The established pharmacodynamics benchmark for P. falciparum sensitivity to artemisinin-based therapy is the parasite clearance rate 6,7 . Resistance to artemisinin-based therapy is considered to be present with a parasite clearance rate greater than five hours 8 . By understanding the genetic factors that affect resistance in malaria, targeted development can occur in an effort to abate further resistance or infections of resistant strains. Previous research has shown success in applying similar machine learning methods in the explanation of genetic differences in plants 9 , fungi 10 , and even humans 11 . Previous work in machine learning-based tropical disease research, including malaria and other diseases, has shown effective in drug discovery 12,13 and in the understanding of degradomes 14 . Also, other machine learning work in malaria has focused on the identification and diagnosis of malaria using image classification [15][16][17] .
In this work, we create multiple machine learning-based models to address these issues around artemisinin resistance and parasite clearance. Given that the interpretation and analysis of many genes and their effects on resistance may be tedious, machine learning allows for a more power investigation into this relationship. Plus, we employ model explainability methods to help rank particular genes of interest in the malaria genome.

Prediction of artemisinin IC 50
First, we created a machine learning model to predict the IC 50 of malaria parasites based on transcription profiles of experimentally-tested isolates. IC 50 , also known as the half maximal inhibitory concentration, is the drug concentration at which 50% of parasites die. This value indicates a population of parasites' ability to withstand various doses of antimalarial drugs, such as artemisinin.

Methods
Training data was obtained from the 2019 DREAM Malaria Challenge 18,19 . The training data consists of gene expression data of 5,540 genes of 30 isolates from the malaria parasite, Plasmodium falciparum. For each malaria parasite isolate, transcription data was collected at two time points [6 hours post invasion (hpi) and 24 hpi], with and without treatment of dihydroartemisinin (the metabolically active form of artemisinin), each with a biological replicate. This yields a total of at eight data points for each isolate. The initial form of the training dataset contains 272 rows and 5,546 columns, as shown in Table 1.

Amendments from Version 4
In this revision, we have addressed the latest reviewer's comments around the applicability of this work to the broader field of parasitology, also have also included some new work from Birnbaum et al. 2020. In addition, we also discuss the need for lab-based (in vitro) validation of these in silico findings, though this work helps to highlight the most probable/ important things to test first.
It should be noted that there is some specific information that reviewers are asking about the input data that we do not have yet as this is part of a larger DREAM Challenge. Once this information is public, we will likely add it to this work as well.
Any further responses from the reviewers can be found at the end of the article The transcription data was collected as described in Table 2. The transcription data set consists of 92 non-coding RNAs (denoted by gene IDs that begins with 'MAL'), while the rest are protein coding genes (denoted by gene IDs that start with 'PF3D7'). The feature to predict is DHA_IC50.

Data preparation
We used Apache Spark 20 to pivot the dataset such that each isolate was its own row and each of the transcription values for each gene and attributes (i.e. timepoint, treatment, biological replicate) combination was its own column. This exercise transformed the training dataset from 272 rows and 5,546 columns to 30 rows and 44,343 columns, as shown in Table 3. We completed this pivot by slicing the data by each of the eight combinations of timepoint, treatment, and biological replicate, dynamically renaming the variables (genes) for each slice, and then joining all eight slices back together.
By using the massively parallel architecture of Spark, this transformation can be completed in a minimal amount of time on a relatively small cluster environment (e.g., <10 minutes using a 8-worker/36-core cluster with PySpark on Apache Spark 2.4.3).
Lastly, the dataset is then vectorized using the Spark VectorAssembler, and converted into a Numpy 21 -compatible array. Vectorization allows for highly scalable parallelization of the machine learning modeling in the next step.

Machine learning
We used the Microsoft Azure Machine Learning Service 23 as the tracking platform for retaining model performance metrics as the various models were generated. For this use case, 498 machine learning models were trained using various scaling techniques and algorithms. Scaling and normalization methods are shown in Table 14. We then created two ensemble models of the individual models using Stack Ensemble and Voting ensemble methods.  Table 14). All of the machine learning algorithms are from the scikit-learn package 25 except for LightGBM, which is from the LightGBM package 26 . The settings for the model sweep are defined in Table 4. The 'Preprocess Data?' parameter enables the scaling and imputation of the features in the data. Note that these models were evaluated using random sampling of the input training dataset provided by the DREAM Challenge, though the evaluation within the challenge was performed on an unlabelled testing dataset. The metrics in the Results section below reflect the evaluation on the sampled training data.
Once the 498 individual models were trained, two ensemble models (voting ensemble and stack ensemble) were then created and tested. The voting ensemble method makes a prediction based on the weighted average of the previous models' predicted regression outputs whereas the stacking ensemble method combines the previous models and trains a meta-model using the elastic net algorithm based on the output from the previous models. The model selection method used was the Caruana ensemble selection algorithm 27 .

Results
The voting ensemble model (using soft voting) was selected as the best model, having the lowest normalized Root Mean Squared Error (RMSE), as shown in Table 5. The top 10 models trained are reported in Table 6.
Having a normalized RMSE of only 0.1228 and a Mean Absolute Percentage Error (MAPE) of 24.27%, this model is expected to accurately predict IC 50 in malaria isolates. See Figure 1 for a visualization of the experiment runs and Figure 2 for the distribution of residuals on the best model.

Prediction of resistance status
The second task of this work was to create a machine learning model that can predict the parasite clearance rate (fast versus slow) of malaria isolates. When resistance rates change in a pathogen, it can be indicative of regulatory      changes in the pathogen's genome. These changes can be exploited for the prevention of further resistance spread. Thus, a goal of this work is to understand genes important in the prediction of artemisinin resistance. The relationship of this use case to the first is that parasite clearance is a measure of the effectiveness of a treatment regimen. While the first use case looked at the drug concentration, this use case looks into the speed at which the parasites are cleared as a result of a standard treatment.

Methods
An in vivo transcription data set from Mok et al., (2015) Science 28 was used to predict the parasite clearance rate of malaria parasite isolates based on in vitro transcriptional profiles (see Table 8).
The training data consists of 1,043 isolates with 4,952 genes from the malaria parasite Plasmodium falciparum.
For each malaria parasite isolate, transcription data was collected for various PF3D7 genes. The form of the training dataset contains 1,043 rows and 4,957 columns, as shown in Table 7. The feature to predict is ClearanceRate.

Data preparation
The training data for this use case did not require the same pivoting transformations as in the last use case as each record describes a single isolate. Thus, only the vectorization of the data was necessary, which was performed using the Spark VectorAssembler and then converted into a Numpy-compatible array 22 . Note that this vectorization only kept the numerical columns, which excludes the Country, Kmeans_Grp, and Asexual_ stage__hpi_ attributes as they are either absent or contain non-matching factors (i.e. different set of countries) in the testing data.

Machine learning
Once the 98 individual models were trained, two ensemble models (voting ensemble and stack ensemble) were then created and tested as before. Model search parameters are shown in Table 9.

Results
The voting ensemble model (using soft voting) was selected as the best model, having the highest area under the receiver operating characteristic curve (AUC), as shown in Table 11. The top 10 of the 100 models trained are reported in Table 10. Having a weighted AUC of 0.87 and a weighted F1 score of 0.80, this model is expected to accurately predict isolate clearance rates. A confusion matrix of the predicted results versus actuals is shown in Table 12. See Figure 3 for a visualization of the experiment runs and see Figure 4 and Figure 5 for the ROC and Precision-Recall curves on the best model. Note that these models were evaluated using random sampling of the input training dataset provided by the DREAM Challenge, though the evaluation within the challenge was performed on an unlabelled testing dataset. The metrics in the Results section below reflect the evaluation on the sampled training data.
Note that the averages reported in Figure 4 and • 'macro': The arithmetic mean for each class. This does not take class imbalance into account.
• 'weighted': The arithmetic mean of the score for each class, weighted by the number of true instances in each class (support).

Feature importance
Feature importances were calculated using mimic-based model explanation of the ensemble model 29 . The mimic explainer works by training global surrogate models to mimic blackbox models (i.e. complex models that are difficult to explain). The surrogate model is an interpretable model, trained to approximate the predictions of a black box model as accurately as possible 30 . In Figure 6 and Table 13, the feature importance values for each class ("Slow", "Fast", and NULL) are shown. This shows which genes are important in the prediction of clearance rate.      The mimic explainer was opted over other traditional methods such as principal component analysis (PCA) because of its ability to provide clearer interpretations into the features' importance. PCA occludes the true values of individual features by summarising multiple features together. Given that insights into particular genes' importance on resistance were desired here, the mimic explainer provides this output in a more straightforward manner.

Discussion
By using distributed processing of the data preparation, we can successfully shape and manage large malaria datasets. We efficiently transformed a matrix of over 40,000 genetic attributes for the IC 50 use case and over 4,000 genetic attributes for the resistance rate use case. This was completed with scalable vectorization of the training data, which allowed for many machine learning models to be generated. By tracking the individual performance results of each machine learning model, we can determine which model is most useful. In addition, ensemble modeling of the various singular models proved effective for both tasks in this work. While the number of training observations for each use case stand to be improved, the usage of adequate cross-validation can help to stabilize the risk of over fitting models to such a small dataset. Also note that there is an imbalance in the number of samples in each class in the clearance rate experiment, which stands to be remedied in future work. There are over double the number of "Fast" clearance rate isolates compared to "Slow". This can be seen in the variation in model performance as indicated by the macro average Precision-Recall curve ( Figure 5).
The resulting model performance of both the IC 50 model and the clearance rate model show relatively adequate fitting of the data for their respective predictions. While additional model tuning may provide a lift in model performance, we have demonstrated the utility of ensemble modeling in these predictive use cases in malaria. In both models, we show that IC 50 and clearance rate can be effectively predicted using transcriptomic analysis data with machine learning. By extension, this is also predicting the phenotypic result of the genetic variations among the samples as is relates to resistance. Contrary to PCA, this estimator does not center the data before computing the singular value decomposition. This means it can efficiently work with sparse matrices.

SparseNormalizer
Each sample (each record of the data) with at least one non-zero component is re-scaled independently of other samples so that its norm (L1 or L2) equals one In a broader sense for the field parasitology, this exercise helps to quantify the importance of genetic features, spotlighting potential genes that are significant in artemisinin resistance. The merit of this work showcases the utility of machine learning to assist in the understanding of the underlying genetic/transcriptomic mechanisms that affect drug performance.
Specific examples include PF3D7 1245300, the most important feature in predicting slow parasite clearance. PF3D7 1245300 is the gene that codes for the NEDD8-conjugating enzyme UBC12 (UniProt ID: Q8I4X8), a ligase used in the ubiquitin conjugating pathway. Another example, PF3D7 1107700 is the most important gene for fast clearance rate. PF3D7 1107700 (UniProt ID: Q8IIS5) is important in the regulation of the cell cycle, specifically in the maturation of ribosomal RNAs and in the formation of the large ribosomal subunit. Future in vitro experiments of this in silico work should be performed to validate these findings. While biological confirmations of these genetic factors are needed, this analysis helps to rank the most probable factors by importance, therefore reducing the in vitro work to be performed.
These two examples of important genes identified here along with the other may one day be the target for future drugs or may prove integral in the overall understanding of how resistance works in P. falciparum. The utility of these models will help in directing development of alternative treatments or coordination of combination therapies in resistant infections and provides an example of the usage of machine learning in the identification of important genetic feature in infectious disease research.

Preprint
An earlier version of this article can be found on bioRxiv (doi: 10.1101/856922).

Underlying data
The challenge datasets are available from Synapse (https://www.synapse.org/; Synapse ID: syn18089524). Access to the data requires registration and agreement to the conditions for use at: https://www.synapse.org/#!Synapse: syn18089524.
Challenge documentation, including the detailed description of the Challenge design, data description, and overall results can be found at: https://www.synapse.org/#!Synapse:syn16924919/wiki/583955.

Alyssa E Barry
IMPACT, School of Medicine, Deakin University and Burnet Institute, Melbourne, Victoria, Australia Myo Naung Walter and Eliza Hall Institute, University of Melbourne, Melbourne, Australia This is commendable work by the authors making use of two publicly available datasets -the 2019 DREAM Malaria Challenge and an in vivo transcription data set from Mok et al., (2015) to create a confident machine learning model predicting IC 50 (the rate at which parasites respond to artmesinin) and the transcriptional features (gene expression) involved in fast vs slow parasite clearance rates. Source codes were also made available to public for reproducibility. The manuscript would benefit from more structure in the "methods" and "results" sections to present clearer analysis workflow and results. Adding more explanation and discussion of biological significance from the generated models should improve the quality of the manuscript. There are a few specific suggestions for the authors in a revision of their manuscript below.

Major
The strongest suggestion is to re-structure the methods section. Rather than having separate sections ("method", "data preparation", "results") for each machine learning exercise, I suggest merging some of these paragraphs. For instance, the entire method section describing model generation (page 3-5) to predict IC 50 can be possibly renamed as "machine learning method to predict IC 50 ". Along this line, the paragraph titled "Prediction of artemisinin IC 50 " should merge with this section. A similar arrangement is also suggested for paragraphs from page 5-7. Adding a generic workflow figure should clarify some of these issues. Both of the results sections (page 5 & 8) should come after methods.
Authors used datasets specific to Plasmodium falciparum malaria and thus title should reflect about P. falciparum malaria. The results section paragraph (page 8) should be expanded to include more explanation of results from figure-6, which can be connected to the previously known attributes of these top-30 hits as discussed in the 4 th paragraph of discussion (page 13).  Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly Reviewer Report 10 July 2020 © 2020 Burrows J. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Jeremy Burrows
Medicines for Malaria Venture (MMV), Geneva, Switzerland Page 3: Artemisinin-based therapies are described as being among the best treatment options for falciparum malaria. ACTs are the mainstay therapy and are, definitively, the best treatment options. This should be altered.
Page 3: The underlying biology of artemisinin partial resistance is becoming clearer -the authors should cite Kelch13-defined endocytosis pathway mediates artemisinin resistance in malaria parasites 1 .
Page 3: In terms of predicting the IC50 of DHA on parasites, the authors really need to comment on the importance of time point and whether the parasites are synchronous or asynchronous. Usually growth inhibition assays are 48h-72h with asynchronous parasites and these result in virtually no differences in the IC50s between WT and highly resistant K13 mutant strains -indeed, this is why artemisinin partial resistance took so long to be identified. Table 1 does show the timepoint (which is good) but the synchronicity of the isolate is not mentioned. Also did the group include well characterized control lab-adapted strains (both resistant and WT)? What is the range of IC50s in the data set? Table 1 -Abbreviations need to be described. I know what DHA is, but some readers may not.
What is UT?
The computational discussion is beyond me, but I was trying to work out exactly what the authors were claiming. Is the conclusion of the first step that the IC50 can be predicted based on the transcriptomic analysis, given full genomic information of an isolate? If so, that could be interesting in predicting phenotype from genotype (in the absence of phenotypic data), but if, on the other hand, it simply confirms resistance will be evident when certain mutations are involved, then that is not so helpful as we know that already. Can the authors very clearly, in layman's terms, explain what value their model offers to the parasitology community? The same points relate to parasite clearance rate? If the model is simply telling us what we know already then that is significantly less useful or interesting than if it predicts things that we do not yet know. Some very clear explanations of the hypotheses and conclusions are needed for non-computational parasitologists to understand the merit of this work. This is not approvable without such clarity, on the assumption that other reviewers with computational expertise have approved the underlying methods.
The identification of PF3D71245300, a NEDD8-conjugating enzyme UBC12 and PF3D71107700 seem to me to be predicted genes for slow and fast clearance are the main conclusions from this work and there should be a stronger statement with respect to the need for follow-up biology to confirm these.
I would like to have seen a plot of predicted vs actual IC50 and predicted vs actual clearance rate in a form that is easily interpretable (perhaps it's there for those in the 'know'). I was still left unclear as to how good the models were; the authors described them as 'adequate' which sounds rather underwhelming.
In short -the work may have merit, but it is not communicated in a form that makes it clear what the added value is to use the model and what the actual quality and impact is of the model.
As for the specific questions about the data used in this study, we are still waiting on the overall DREAM Challenge write up and release to occur, which should contain must more indepth information about the lab procedures (timepoints, test, etc.) and data collection. We are open to adding this information into our paper as well once we can get it from the DREAM Challenge.

Stefan Jaeger
National Library of Medicine, National Institutes of Health, Bethesda, USA

Sameer K. Antani
Communications Engineering Branch, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA The authors have addressed several, but not all of the reviewers' comments. The description of the state-of-the-art could be stronger. For example, the authors should discuss the status quo in machine learning for malaria drug-resistance detection, and the status/results of the DREAM Competition in particular, including the context of the data used. Some questions remain regarding the computation of averages in the ROC and precision-recall curves in Figures 4 and 5, see for example the bump in the latter (blue curve). The authors have not explained Figure 6, as reviewers asked them to do (labeling and legend fonts are too small). The authors also don't explain how they do the testing (size of test set, evaluation scheme etc.) I am not sure about the usefulness of Figure 2 -Sensitivity and specificity are more intuitive measures than squared errors.

Is the description of the method technically sound? Partly
Are sufficient details provided to allow replication of the method development and its use by others? Partly If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Partly Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly Colby Ford, University of North Carolina at Charlotte, USA We sincerely appreciate the reviewers' feedback on this work and have improved the article based on your recommendations.
We have addressed each comment as follows in the article: Added additional examples of ML-based work in genomics, other tropical diseases, and in malaria.