Machine learning model to predict endophytic colonisation of rice cultivar plant tissues by <i>Beauveria bassiana</i> isolates and their potential as bio-control agents against rice stem borer using existing knowledge

Mireille Merlise Megnidio-Tchoukouegno; Evariste Bosco Gueguim Kana; Wonroo B.A. Bancole

doi:10.12688/f1000research.126479.1

Home Browse Machine learning model to predict endophytic colonisation of rice...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Machine learning model to predict endophytic colonisation of rice cultivar plant tissues by Beauveria bassiana isolates and their potential as bio-control agents against rice stem borer using existing knowledge

[version 1; peer review: 2 approved with reservations]

Mireille Merlise Megnidio-Tchoukouegno ¹, Evariste Bosco Gueguim Kana², Wonroo B.A. Bancole ³

PUBLISHED 03 Nov 2022

Author details Author details

¹ Department of Civil Engineering, Durban University of Technology, Private Bag X01, Scottsville, Pietermaritzburg, 3209, South Africa
² School of Life Sciences, University of Kwazulu-Natal, Private Bag X01, Scottsville, Pietermaritzburg, 3209, South Africa
³ School of Agricultural, Earth and Environmental Sciences, College of Agriculture, Engineering and Science, University of KwaZulu-Natal, Private Bag X01, Scottsville, Pietermaritzburg, 3209, South Africa

Mireille Merlise Megnidio-Tchoukouegno
Roles: Conceptualization, Investigation, Methodology, Project Administration, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Evariste Bosco Gueguim Kana
Roles: Supervision, Writing – Review & Editing

Wonroo B.A. Bancole
Roles: Conceptualization, Resources, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Plant Science gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background: Finding well-known Beauveria bassiana isolates that could preserve rice crops from Sesamia calamistis (stem borer) is problematic. Another difficult task is the development of precise inoculation methods, which have been employed for their establishment as endophytes in cereal crops. This study proposed machine learning models to predict the best entomopathogenic fungi, Beauveria bassiana that could directly protect rice crops against Sesamia calamistis.
Methods: Data driven machine learning decisions were implemented and assessed from 60 experimental runs with nine different feature/input variables and three target/output variables following foliar spray and seed treatment inoculation method. The feature variables consisted of rice plant tissue, such as Nerica-L19, Nerica1, Nerica8, the time, and the five promising isolates Beauveria bassiana (Bb3, Bb4, Bb10, Bb21, Bb35). The target variable consisted of the number of colonised roots, stems and leaves, expressed as a percentage depending on the degree of protection after each inoculation. A data driven decision by the extreme gradient boosting regression algorithm was used to proficiently abstract the situation where there is no direct relationship between features and target variables.
Results: The foliar spray inoculation method exhibited high coefficient of determination (R²) of 0.99, 0.98 and 0.94 depending on the number of colonised stems, roots and leaves, respectively, while the seed treatment approach exhibited the coefficient of determination (R²) of 0.91, 0.87 and 0.75, respectively.
Conclusions: These results demonstrated that the Extreme Gradient Boosting algorithm effectively abstracted the nonlinear relationship between the attribute variables that were taken into consideration and predicted Beauveria bassiana as a bio-pesticide for rice and perhaps other cereal stem borers. Thus, this XGBoost regression model could be used to navigate the optimization domain and reduce the development time of the biocontrol process.

Keywords

Endophytic Colonisation, Beauveria bassiana, Sesamia calamistis, Entomopathogenic Fungi, Machine Learning, XGBoost, Bio-pesticide

Corresponding authors: Mireille Merlise Megnidio-Tchoukouegno, Evariste Bosco Gueguim Kana, Wonroo B.A. Bancole

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2022 Megnidio-Tchoukouegno MM et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Megnidio-Tchoukouegno MM, Gueguim Kana EB and Bancole WBA. Machine learning model to predict endophytic colonisation of rice cultivar plant tissues by Beauveria bassiana isolates and their potential as bio-control agents against rice stem borer using existing knowledge [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1249 (https://doi.org/10.12688/f1000research.126479.1) First published: 03 Nov 2022, 11:1249 (https://doi.org/10.12688/f1000research.126479.1) Latest published: 03 Nov 2022, 11:1249 (https://doi.org/10.12688/f1000research.126479.1)

1. Introduction

The agricultural industries continued success is problematic to our future due to global climate changes. However, enhanced pest attacks on crops have decreased productivity in agriculture, which is severely affected by global climate changes. Numerous cereal crops, particularly rice, are crucial for human nourishment and are farmed all over the world especially in West Africa where it has become the primary source of employment and subsistence for destitute households (Bancole et al., 2020; Nguyen and Ferrero, 2006). Its consumption in Africa has significantly increased, making it the continent’s second-largest source of carbohydrates. However, about 140 insect pests attack rice, maize, wheat, sorghum, particularly Lepidopteran stem borers are the most commercially significant insect pests that influence its production, and their management depends on the use of pesticide chemicals that have harmful consequences on people and biodiversity. Other issues brought on by the use of chemical pesticides include environmental issues, residues in food, water and soil, the possibility of harmful effects on humans and non target creatures, as well as their prohibitive cost for small scale farmers (Goulson, 2013; Togola et al., 2018). The use of chemical pesticides has been a major component of the pest management measures against this borer pest. However, because of their elusive feeding habits, negative impacts on the environment, and danger to human health, borer pests are very difficult to manage with chemical insecticides. It has also been demonstrated that many stem borers such as S. calamistis have developed resistance to chemicals. Hence there is a need for the development of an alternate, safe control measure using entomopathogenic fungus like Beauveria bassiana (Balsamo) Vuillemin. Studies and research have shown that B. bassiana is endophytic in a number of crops and have established its function in defending plants against diseases and pest arthropods (Ownley et al., 2008; Ownley et al., 2010; Gurulingappa et al., 2011; Dara, 2013; Hollingsworth et al., 2020; Rai and Ingle, 2012; Silva et al., 2020; Wagner and Lewis, 2000) with many making it a potential mycopesticide (Wei et al., 2020; Zhang et al., 2012; Barra-Bucarei et al., 2020). Endophytic colonisation of entomopathogenic fungi like B. bassiana within the plant system provides more benefits than external application due to its various traits, including parasitism of a wide range of pests, different mechanisms of pathogenicity, environmental safety, endophytic colonisation, and ease of production (Azevedo et al., 2000; Vega et al., 2009; Cherry et al., 1999; Kikuchi et al., 2015). In order to manage different insects, some strains have been injected into different plant species utilising a variety of inoculation techniques, including seed treatments, soil drenches, foliar and flower sprays, and stem injections. Numerous investigations have been conducted to identify the Beauveria bassiana endophytic strains that are most effective in cereal crops. Numerous experimental research have also been conducted to determine the most effective inoculation technique and to identify a safe protection technology using endophytic entomofungal infections.

The creation of machine learning algorithms, which are a group of analytical techniques that automate the process of creating models and iteratively learn from data to gain insights without explicitly programming, has made it possible to use more effective and powerful tools to not only determine the best inoculation technique for protecting cereal crops from insects and other crop infections but also to assess the potential of various promising indigenous isolates of Beauveria bassiana.

Advancement in technology has driven many researchers to apply machine learning approaches to various agricultural sector. For example finding the crop succession and stamp behavior (Hazard et al., 2018; Johnson and Zhang, 2014), using environmental data as training data to determine the ideal future weather conditions for growing good crops (Kamilaris and Prenafeta-Boldú, 2018). Various algorithms, including swarm intelligence optimisation, artificial neural networks, k-nearest neighbour, and genetic algorithms that were also expanded with the aid of pesticides control in the field of plant pathology, have been used in other studies to evaluate crop yield time prediction and crop pest prediction (Teeda et al., 2018; Cai and Sharma, 2021). Furthermore, 26 diseases and 14 crop species have been categorised using deep convolutional neural networks (Mohanty et al., 2016).

Machine learning algorithm has also been used to correctly identify 13 different plant diseases as well as identify bacteria with high prediction accuracy (Schikora et al., 2010; Sladojevic et al., 2016). However, due to the implications for precision agriculture, prediction and quantification of the best biological control agent, and the best inoculation method may be more crucial in the future than disease categorisation and identification. Such studies might result in early insect prevention for cereal crops and lower pesticide costs.

The research’s objective is to use machine learning algorithm to investigate, study and analyse the entomopathogenic fungi, Beauveria bassiana, one of the biological control agent that directly protect rice crops against S. calamistis, the most common rice arthropod in West Africa. Additionally, the inoculation technique highly affects how well rice crops are protected from pests and other crop infections. As a result, the proposed algorithm will also aid in predicting the most effective crop pest inoculation method.

2. Methods

2.1 Research methodology

This section follows machine learning workflow to explore and prepare our dataset for modelling purposes. The process consists of learning about the data, cleaning it by removing outliers, converting categorical variables to numerical variables, training the model using various machine learning algorithms, and evaluating the model’s performance using existing regression metrics. This procedure has several phases, which are as follows:

Data collection and description

The experimental data used for this research were obtained from (Bancole et al., 2020) previous studies. The original tabulated dataset contains 63 data points for each targeted rice plant tissues. Each variable in the dataset was classified as categorical or numerical based on its nature. The selected features variables in the original dataset consisted of African rice cultivar such as NERICA-L19, NERICA1, and NERICA8, the five Beauveria bassiana such as Bb3, Bb4, Bb10, Bb21 and Bb35, and the time. The target variables were the percentage of roots, stems, and leaves sections colonised based on their degree of protection after inoculation.

Data preparation

Before developing each model, a pre-processing phase was carried out to enhance the model’s predictive power in order to assess the potential of five promising indigenous isolated Beauveria bassiana as endophytes in rice sections. The least significant data points (controls 1, 2, and 3) were also manually removed from the original dataset because they had no bearing on the colonisation of rice tissue following each inoculation method. Furthermore, the time was normalised in accordance to Equation 1 (Sewsynker and Kana, 2016) by translating the data into the range [0,1].

(1)

e_{i} = \frac{e_{i} - E_{\min}}{E_{\max} - E_{\min}}

where E_min and E_max stand for the minimum and maximum values and e_i represents the normalised data. Rice plant tissues and Beauveria bassiana strain features were classified in the original dataset as categorical values. But due to the fact that machine learning did not work directly with categorical values, Sklearn library (RRID:SCR_019053) (Pedregosa et al., 2011) was used to automatically encode the features into numerical values. To map categorical values to integer values, OneHotEncoder, a method for converting categorical values to numerical values, was used. Each integer value was represented as a binary vector, with all zero values except the integer’s index marked as one (Seger, 2018). In addition, the target variable was taken as a percentage of roots, or stem, or leaves colonised depending on the inoculation method with the highest number considered as a 100% protection. Finally, a high-level interface for creating appealing and instructive statistical visualisations was provided by seaborn (Waskom, 2021), a Python data visualisation library.

Feature selection

The most important characteristics influencing the rice plant tissue were identified using a process called feature selection. This was achieved by measuring the linear relationship between two or more variables. The rationale for utilising correlation to choose features is that the attributes have a strong connection with the target variable. A further requirement is that attributes should be uncorrelated among themselves while being correlated with the target variable. Due to the potential impact that strongly correlated feature variables may have on an algorithm’s performance, this procedure is crucial to machine learning.

2.2 Model development

In order to analyse the prediction of the entomopathogenic fungi, computational intelligence techniques including Linear regression (LR) (Pedregosa et al., 2011), least absolute shrinkage and selection operator (LaSSO) (RRID:SCR_003418), support vector regression (SVR), k-nearest neighbor (KNN), ensemble learning (EN), and Extreme Gradient Boosting (XGBoost) (RRID:SCR_021361) were used, with an emphasis on accuracy and efficiency as well as their ability to handle experimental data. Following a thorough analysis and application of the various machine learning prediction techniques, it was discovered that the scalable, adaptable, precise, and reasonably quick XGBoost regression approach offered a more regularised model formalisation and improved over-fitting management.

XGBoost regression is a type of ensemble machine learning algorithm that can be used to solve problems involving classification and regression predictive modelling. In this algorithm, Decision tree models are used to build ensembles and trees are added to the ensemble one at a time and fitted to fix the prediction errors caused by prior models.

Suppose we have K trees as explicitly described in Wang et al. (2019), mathematical prediction output of XGBoost can be written as

(2)

\hat{y_{i}} = \sum_{k = 1}^{k} f_{k} (x_{i}), f_{k} \in F .

where F is the space regression trees, each f_k corresponds to the prediction from a decision tree, f_k (x_i) is the result of tree k and is the predicted value of i-th instance x_i (Wang et al., 2019). The objective function of the above equation is given by

(3)

Obj (θ) = L (θ) + Ω (θ)

where

L (θ) = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i})

is the loss function, which measures how predictive our model is with respect to the training data, y_i is the target variable, and

Ω (θ) = \sum_{k = 1}^{K} Ω (f_{k})

is the regularisation parameter that controls the model’s complexity and prevents over-fitting. Now, we use the additive technique to train the model by letting

{\hat{y}}_{i}^{t}

be the prediction of i-th instance at the t-th iteration, and

{\hat{y}}_{i}^{t}

can then be expressed as:

(4)

{\hat{y}}_{i}^{t} = {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i}) .

In this situation, it minimises the following objective:

(5)

{Obj}^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t}) .

Second-order Taylor approximation can be utilised in the general scenario as follows to optimise the objective:

(6)

{Obj}^{(t)} = \sum_{i = 1}^{n} (l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})) + Ω (f_{t}),

where g_i and h_i are first and second order gradient statistics on the loss function, respectively. The model uses a weak learner, often known as decision trees, to produce predictions, an additive model to minimise the loss function when adding trees, and a loss function that must be optimised. It is intended to be very effective and computationally efficient, possibly even more effective than existing open-source versions (Tianqi, 2016; Friedman, 2001). The primary issue with complex non-linear algorithms, such as gradient boosting, is their propensity to over-fit training data and the frequent strategy to limit this type of issue is early quitting (Raskutti et al., 2011).

2.3 Model validation

The goal of creating a predictive model is to create a model that is accurate on previously unseen data. This can be accomplished using statistical techniques in which the training dataset is carefully used to estimate the model’s performance on new and unknown data. The most basic technique of model validation is to perform a train/test split on the dataset. A typical ratio for this varies depending on the amount of data, but it is critical to have enough training data. After training the model with the hyper-parameters provided by the algorithms, predictions on test data must be made and compared to the expected results.

In the current study, the datasets were split into training sets, which comprised 80% of the datasets, and testing sets, which comprised 20% of the datasets. The hyper-parameters were supplied as arguments while creating an instance of an XGBoost regressor from the XGBoost library. This stage is crucial for controlling how effectively the machine learning techniques are being used. The training section of the datasets was used to train the adopted algorithms with tuning parameters, and the testing portion of the datasets was used to demonstrate the developed model’s response to new data being processed for the first time.

Random-state was also assigned to maintain reproducibility of the results. Root mean squared error and the coefficient of determination R² were computed on validation data and used to assess the accuracy of the model. Parameters tuning was conducted to avoid over-fitting and under-fitting. This was achieved by varying some XGBoost parameters between its minimum and maximum value while all other parameters were maintained at their default values.

Our main focus were on the four main parameters such as number of estimators, learning rate, colsample-bytree, max-depth, and the regularisation parameters. As for other parameters, default values set in XGBoost package were considered. When training a deep tree, XGBoost rapidly consumes memory, thus we should be cautious when choosing big values of max-depth (Tianqi, 2016). In order to achieve this, suitable hyper-parameter values can be determined through systematic testing, such as grid searching across a range of values, or by trial and error for a given dataset. It is important to emphasise that cross-validation was used to choose the optimum parameters and decrease the weight of each step in order to strengthen the model.

3. Results

It is critical to identify the most influential features via correlation in order to speed up the prediction process and avoid potential over-fitting by reducing the number of attributes considered. This was accomplished by examining the data to determine how features variables affect the colonisation of roots, stems, and leaves following the two inoculation methods. Furthermore, a model performance evaluation was presented, in which the accuracy of the prediction results was verified and the competency of the four algorithms was compared for different k-fold cross-validations.

For correlation analysis with regard to feature engineering results, the most predominate attributes are taken into account. Correlation values closer to 1 signify a strong and direct correlation between the two features since attributes can be thought of as the Pearson Coefficient. A high yet inverse correlation, however, is indicated by correlation values that are closer to -1. For instance, the seaborn library was used to plot the correlation plot between variables.

Figure 1 (Kana et al., 2022) presents a one to one relationship between variable. Every variable shows a relationship with another variable regardless of the inoculation method. We can see a strong positive correlation between number of colonised stem sections and the passage of time (30-60 days). This shows that the degree of stems protection increases with an increase number of days following foliar spray treatment. Also, Beauveria bassiana isolates Bb10 and Bb3 have a 0.26 and 0.19 correlation values with the number of stem section colonised value, respectively, indicating that Bb10 and Bb3 are the most effective strain regarding the rice stem tissue. Nerica1 has a 0.2 correlation value with the rice stems specie, indicating that the level of colonisation of the rice stems favor the rice cultivar, Nerica1. On the other hand, Nerica8 and the rice stems tissue have a -0.26 correlation value, indicating that the colonisation of rice stem specie do not favour the rice cultivar, Nerica8. There is also a moderate negative correlation between number of stems section colonised, Bb21 and Bb35, indicating similar levels of pathogenicity of the Beauveria bassiana isolates.