Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement [version 1; peer review: awaiting peer review]

Agronomic traits of plants especially those of economic or aesthetic importance are threatened by climatic and environmental factors such as climate change, biotic, and abiotic stresses. These threats are now being mitigated through the analyses of omics data like genomics, transcriptomics, proteomics, metabolomics, and phenomics. The emergence of high-throughput omics technology has led to an avalanche of plant omics data. Plant research demands novel analytical paradigms to extract and harness large plant omics data for plant improvement effectively and efficiently. Machine learning algorithms are well-suited analytical and computational approaches for the integrative analysis of large unstructured, heterogeneous datasets. This study presents an overview of omics approaches to improve plant agronomic traits and crucial curated plant genomic data sources. Furthermore, we summarize machine learning algorithms and software tools/programming packages used in plant omics research. Lastly, we discuss advancements in machine learning algorithms' applications in improving agronomic traits of economically important plants. Extensive application of machine learning would advance plant omics studies. These advancements would consequently help agricultural scientists improve economically important plants’ quality, yield, and tolerance against abiotic and biotic stresses and other plant health-threatening issues


Introduction
The global agricultural system is threatened by ecological events such as climate change and other environmental stresses. 1,2 These events affect the yield, stability, and quality of production of economically important plants, i.e., medicinal plants, fruit crops, food crops, cereal or grain crops, legume seed crops or pulses, etc. 1,3 These challenges are addressed by omics approaches via numerous unconventional improvement methodologies. 4, 5 Omics approaches involve analysis of the constituents of genome sequence and other macromolecules generated from encoded genomic information. The utilization of omics-derived knowledge and technologies in plant improvement strategies is limited and difficult. Other drawbacks of omics technologies include a lack of data integration and effective phenotype-genotype correlation strategies. As a result, promoting the integration of computational biology and plant genomics to assist plant development is critical. 6,7 This paper gives an overview of omics methodologies for improving plant agronomic traits as well as essential curated plant genomic data sources. We discuss the bioinformatics software, tools, and packages that are utilized in omics-based plant improvement research. We also dissect how machine learning algorithms has been used to improve agronomic features of commercially significant plants, their major contributions and future outlook in plant omics and agronomics.

Plant genome sequences and bioinformatics resources
Genome sequencing was made possible with the advent of sequencing technologies. 8,9 Complete sequencing of a plant genome was first demonstrated for a model plant named Arabidopsis (Arabidopsis thaliana) 10,11 and afterward for rice (Oryza sativa). Subsequently, the whole genome of over 250 species in the plant kingdom have been sequenced: bryophytes, pteridophytes, gymnosperms, and angiosperms 12,13 (Figure 1). Angiosperms account for 95% of the sequenced species, most of which are economically important plants or their wild relatives ( Figure 2). Food crops like rice, wheat, beans, oat, maize, and soybean are among the sequenced plants, as are ornamental plants like orchid and hibiscus, industrial plants like oilseed, hemp, and spice/herbs like garlic, ginger, turmeric, moringa, artemisia, and neem, which are known for their high therapeutic value.
There have been a variety of databases created to access plant genome datasets. 14,11 The model plant A. thaliana genome database launched in 2001 was the premier plant genome database. 10,15 Subsequently, many databases and resources have been developed for plant genomes. The earliest genome databases were essentially archives of genome sequence data. These databases have expanded into genome portals/hubs that combine different genomic data and web servers that offer online genomics analysis. The availability of annotated plant genome data has led to many discoveries, including genome organization and gene function. 16 These discoveries elucidate the complexity, evolution, and dynamics of plant genomes, contributing to a deeper understanding of plant biology. 6,17 Available genomic information includes ciselements, gene expression data, protein interactome, transcriptional and post-transcriptional data. These genome databases exist as single species and comprehensive databases as shown in Table 1.   and organization are studied in structural genomics, 13,19 functional genomics investigates the functions, interaction, and regulation of gene and gene products. 5,20 Plant functional genomics is a goldmine in agronomic traits improvement. It incorporates other omics approaches like transcriptomics, proteomics, metabolomics, phenomics, etc. 16,21 (Figure 3). Other aspects of genomics are epigenomics, mutagenomics, and pangenomics. 22 Epigenetic changes, such as histone modifications, small RNA and DNA methylations occurring at the genomic phase, are dissected within epigenomics. Mutagenomics is used to explore mutation events mediating modified genotype and phenotype in mutant species. Pangenomics studies the whole set of genomic sequences present in the entire population of a species. It also explores dispensable genomes that are individual specific or partially shared. Mutagenomics and pangenomics are new omics techniques in crop sciences. 22,23 Plant transcriptomics involves investigating the control of plant metabolite production processes at the RNA level. Transcript-level gene expression control regulates the whole plant's development and growth. 24 Plant proteomics explores the structural and functional features of proteins in a living organism. It encompasses studies on plants' typical morphological and physiological properties. 16,22 The role of proteins in controlling the plant metabolic processes is also studied in plant proteomics, especially in medicinal plants. 25 Plant metabolomics involves profiling primary and secondary metabolites in plants. 26 Metabolic data are useful when developing metabolic correlation networks. These networks can aid comparative analysis of cellular compartments such as carbon and nitrogen transport and partitioning in plants. 27 In addition, the molecular and cellular regulation of different enzymatic processes can also be investigated. 25 Plant phenomics involves the systematic study of phenotypes such as plant composition, growth, and production analysis. This study can be conducted both in controlled environments and in the field. Field phenomics involves the measurement of phenotypes that exist under both cultivated and natural conditions. Studies in controlled environments involve glasshouses, growth chambers, and other systems where growth conditions can be manipulated. 28 These multi-omics approaches have emerged successful for plant research, including agronomic traits improvement over the last few decades. Agronomic traits are desirable plants' genetic or phenotypic features, i.e., quality traits, disease resistance, pest resistance, insecticide tolerance, temperature, drought, and other adverse environmental factors tolerance traits. Quality traits encompass morphological features like plant height, seed weight; physiological features like chlorophyll content and photosynthetic rate 29 ; economic features like improved crop yield, processing, and storage; pharmaceutical and industrial features like the elimination of toxins and allergen, increased nutritional or dietary value and increased medicinal values. 30 Recent research suggests that when multiomics technologies are integrated, they can be better harnessed to improve genetic development, crop breeding science, plant stress resistance, and other agronomic traits. 22,23 2. Major areas of application of omics technologies for agronomic traits improvement 2.1 Genomics-assisted pre-breeding Genomics-assisted pre-breeding is a genetic manipulation strategy to improve agronomic traits of interest in plants at the DNA level. 31 Genomics-assisted pre-breeding approaches positively contribute to the efficiency of diseases and climateresilient crop development. [32][33][34] Crop breeding across the globe has relied on a series of phenotypic selection and crossing before the genomic era to generate superior crop genotypes. 35 Genome sequence availability has paved the way for identifying all genes and genetic variants associated with agronomics traits. 36,37 Besides, it has made it possible to assess genotype level changes incurred during breeding processes. 38 Plant breeders have utilized genomics and bioinformatics in gene-level resolution of agronomic variation using quantitative trait loci (QTL) mapping 39-41 and genome-wide association studies (GWAS). 42,43 For instance, studies have recently been conducted to develop multiple stress-adaptable rice species that are disease and climate resilient using genomics-assisted breeding techniques such as quantitative trait locus (QTL), gene/markers-phenotype association and phenotype selection. [44][45][46] Pea breeding projects used genetic marker-trait associations to boost valued yield and market-preferred agronomic traits. 47,48 Miedaner et al. 43 used highdensity genotype arrays and comprehensive phenotyping of the same species population across diverse conditions, locations, and seasons in genomic selection and population mapping to speed the breeding of disease resistance traits in maize, small-grain cereals, and wheat. Hu et al. 39 harnessed genomic selection (GS) and genome-wide genetic variants to prevent reiterated phenotyping in breeding cycles. These studies indicate new breeding techniques such as speed breeding and genomic selection to boost genetic and trait improvements. However, the lack of robust phenotypic data limits the efficient utilization of available genomic information and technologies in genomics-assisted breeding.

Evolution and crop diversity
Variation in gene content among individuals within the same species is caused by genetic variation ranging from single-nucleotide polymorphisms to substantial structural variants (SVs). Due to human and natural selection acts, this variation offers the raw material on which evolution occurs. 49,50 Deviation in agricultural plants' phenotypic and genetic characteristics is referred to as crop diversity. 36 The understanding of crop diversity is enhanced by plant genomics at both species and gene levels. 51 According to recent research, a single reference genome is insufficient to capture a species' entire genetic diversity landscape. Pan-genome analysis provides a platform for evaluating a species' genetic diversity by looking at its whole genomic repertoire. Pan-genomic studies have shed new light on the landscape of diversity and improvement of major crops such as Brachypodium distachyon, 52 Brassica Spp., 53

Abiotic and biotic stresses
Biotic stresses are instigated by living organisms such as insects, parasitic plant nematodes, diseases, or weeds in production agriculture. 65 Genomics approaches to biotic stress include ribonucleic acid interference (RNAi) silencing and transgenesis. RNA interference (RNAi) silencing is employed against viruses and some fungi, while transgenesis has been exploited to develop resistance against some fungi, for example, Fusarium head blight. 66 Genome-wide identification and expression analysis in legume crops also revealed the role of small RNA biogenesis mediators in biotic stress response regulation. 42 Abiotic stresses, such as low or high temperatures, heavy metals, insufficient or excessive water, high salinity, ultraviolet radiation, are hostile to developing plants, resulting in significant wane in crop yield worldwide. 67 According to Kumar et al., 28 knowledge of plants' response to abiotic stresses can be enhanced by integrating information generated from metabolomics and proteomics with genomics data. Sustenance of yield in crops threatened by abiotic stresses is a significant challenge in breeding resilient crop varieties. 68 A study on Brassica oleracea shows that heat stress transcription factors are integral to signal transduction pathways functioning in response to environmental stresses and are suggested to contribute significantly to various stress responses. 62 Heat stress transcription factors genes were identified in the in silico analysis of B. oleracea. The identified genes may be exploited in developing crop varieties resilient to global climate change. 1,62

Population studies
Population genomics is employed to study adaptation and speciation. Population genomics datasets are used in GWAS to detect the genes responsible for adaptive phenotypic variations of large plant population samples. 35,69 For instance, Bamba et al. identified specific adaptation loci in a GWA study and unveiled the molecular basis of genetic trade-offs. It also showed that ecological fitness could be predicted by polygenic effects of several loci associated with local climate. 35 Medicinal plants' diversity is of exceptional interest because of their ethnomedicine role. GWAS studies on adaptive genotypic and phenotypic variation provide a framework to assess the diversity of medicinal plant application across different cultures and infer modifications in plant use over time. 70,71 Other genomic approaches such as genomic selection, nested association mapping, genetic diversity, and allele mining have been integrated into crop improvement programs to address the genetic issues associated with maize productivity and nutritional contents. 72 Plant omics studies have greatly helped our understanding and interpretation of plant responses to ecological influences and their contribution to key developmental processes important for crop yield and food quality. However, there are still some problems, such as a lack of data integration and robust techniques for phenotype-genotype correlation. Also, the use of omics-derived knowledge and tools in plant improvement strategies is limited and difficult. As a result, there is a pressing need to promote the integration of computational biology and plant genomics to benefit plant improvement. 6,7 3. Applications of machine learning in plant omics and agronomics 3.1 Machine learning algorithms and resources Machine learning (ML) is a computer science field that utilizes algorithms to learn and capture the characteristics of target patterns of complex datasets. 73 Machine learning algorithms are generally classified into the following categories; supervised, semi-supervised, unsupervised, reinforcement, and deep learning. 74 A supervised ML algorithm is trained using a labeled dataset. It learns to respond more accurately based on these training sets by comparing its output with the given input. 67,73 Semi-supervised algorithms provide a tool that harnesses the potential of both supervised and unsupervised learning. These algorithms are ideally adapted for model building and can be used for classification, regression, and prediction. 75 Unsupervised learning is all about identifying unexplained existing patterns from the data to generate pattern rules. Unsupervised learning is a learning approach focused on statistics and thus applied to the issue of discovering a hidden structure in unlabeled data. 7,74,76 Reinforcement learning is considered an intermediate form of learning as the algorithm is only provided with an answer that tells whether the output is correct or not. 75 Deep learning is built on artificial neural networks (ANN). The algorithms extract higher-level features from the raw input using multiple layers of neural networks. Learning of the algorithm can be unsupervised, semi-supervised, or supervised. 73 Machine learning approaches provide unique techniques for integrating and analyzing omics data, allowing for the improvement of crops and other economically important plants. Some machine learning algorithms have been used to developed tools specifically for plant omics analysis. Table 2 highlights the existing machine learning tools for plant omics analysis. Machine learning algorithms have a broad range of applications in plant genomics. These algorithms play vital roles in genome assembly, iterative gene regulatory network inference, and the identification of true SNPs in polyploid plants. 77

Precision plant breeding
Precision breeding is a genetic engineering technique that involves reproducing organisms of the same species together to preserve desirable characteristics and create a stronger hybrid. 84 Traditional statistical methods mainly used in plant breeding strategies are ineffective in plant data analysis because of the non-deterministic and nonlinear nature of plant features attributed to environment, genotype, and interaction. 85 Machine learning has enabled effective plant phenotyping and data mining for patterns such as genotype and trait correlation. 39 It has also been successful in genomic selection. Genomic selection is a critical method used in selecting plant species with genetic gains of interest in plant breeding. Applying different ML algorithms in building GS models has produced robust and accurate prediction. 86,87 Multilayer neural networks (NNs) have been used in genetic value prediction in plant breeding. NN models are efficient in predicting genetic value, regardless of the population size, heritability, or coefficient of variation. Thus, the ANN is promising for genetic value prediction in unbiased experiments. 88 Multilayer NNs have been utilized to select bean genotypes with highly stable phenotypes using 13 genotypes of common beans between 2002 and 2006. The integration of this model in plant breeding has enabled precise genetic value prediction and selection. 89 Also, deep learning generated a robust prediction accuracy in grain yield compared to the conventional linear statistical methods used in traditional plant breeding when analyzing multiple traits with mixed ordinal, continuous, and binary phenotype data. Univariate and multivariate deep learning models' predictive performance was assessed using the Durum wheat (Triticum turgidum var. durum Desf.) dataset. Deep learning model performance shows that it has a promising potential to be a successful model for accurate genomic prediction in plant breeding. 90 The flexibility of machine learning algorithms makes them a viable alternative to traditional parametric methods for predicting categorical and continuous responses in genomic selection. 91 An ensemble of RF and SVM was implemented to improve genotype-phenotype classification using manually derived root trait datasets. The combined model accurately identified the most distinguishing root traits and corresponding cultivar differentiation. The model's performance demonstrates the potential of ML approaches in unbiased cultivars classification and trait selection. 92 Additionally, predictive models have aided the integration of additive and dominance effects in GWAS and have enhanced the prediction of complex agronomic traits in polyploid plant species. For instance, a study revealed the feasibility of genome-wide prediction of potato agronomic traits despite being an autotetraploid food crop. It also shows that GWA prediction is viable in selecting breeding values in elite germplasm with substantial non-additive genetic variance. 93

Phenomics
Plant phenomics is a systematic study of plant phenotypes. 28 In recent years, plant field phenotyping has gotten a lot of attention with the possibility of crop fields' high-throughput analysis. 94 The application of machine learning methods and the various technological developments for image analysis have improved quantitative crop traits assessment. 92,94,95 For instance, CNN-based detection and analysis of wheat spikes using wheat field trials images captured over one planting season achieved an average accuracy of 88 to 94% across diverse groups of test images. CNN's high-performance accuracy shows that it is a robust model for genome-based selection and prediction in plant breeding. 96 Also, the RF algorithm was used in plant image segmentation involving the acquisition and processing of several plant images samples. 97 The predictions made by the model enabled the discovery of various parameters relevant to plant growth. 94 3.4 Stress resilience phenotyping ML learning has been exploited in identifying favorable agronomic traits, including abiotic and biotic stress resistance. ML algorithms are integrated into conventional statistical methodologies to optimize the accuracy of plants stresses prediction and detection. 96 For instance, 25,000 soybean leaflets images exposed to varied diseases and nutritional perturbation were used to develop a convolutional neural network (CNN), which can infer the image features of the disease types and dietary deficiencies at high resolution. The prediction accuracy of the ML framework was very close to that of human expert diagnosis. Other plants' induced stresses can also be identified, classified, and quantified using the model. The model can also be adapted to identify, classify, and quantify the induced stresses in other plants. 95 Random forest has been used to predict metabolite and transcript markers in drought tolerance prediction using experimental drought-stressed plant field trial datasets. The low error rate recorded in the model shows that the model could be considered as an alternative model for accurate prediction and identification of molecular markers. 98 RF was used to identify suitable features combination for phenotypic traits prediction using data derived from various agromanagement treatment experiments. This approach achieved optimal prediction accuracy and improved plant breeding strategies by enabling maximal allocation of stress management resources. 99 Sanz-Carbonell and colleagues used deep sequencing and computational approaches such as PCA and Clustering analysis to infer the biotic and abiotic stress responses regulatory network mediated by miRNA. 24 miRNAs were used in this study, all of which are known to alter expression significantly under stressful conditions. The prediction generated inference that target genes of miRNAs down-regulated under stress conditions contribute to plant response to stress, whereas miRNAs that are up-regulated control genes associated with growth and development. 67 Soybean fields were screened for tolerance to soybean iron chlorosis deficiency (abiotic stress in soybean) using linear discriminant analysis (LDA) and SVM. The phenotypic data obtained from soybean fields were used in model training and predicting soybeans' iron chlorosis deficiency stress. The ML application has helped evaluate the severity of real-time stress in the soybean sector. 95

Plant-pathogen interaction and diseases prediction
Plant diseases and pests pose a significant threat to agriculture. Early identification of plant diseases and pests would aid in developing effective treatment strategies while economic losses are mitigated. 100 Diverse ML approaches for precise disease recognition and prediction have been implemented in plant populations. 101,102 Neural networks (NNs) have achieved impressive results in plant disease prediction using image classification. A deep convolutional network was implemented in leaf image classification model for disease recognition. The developed model showed a high predictive performance in distinguishing plant leaves from their surroundings and recognized 13 plant diseases types on healthy leaves. 103 In another approach, a heterogeneous ensemble of deep-learning-based neural network models was used in detecting tomato plants diseases and pests using images collected on-site by imaging devices of varying resolutions.
The ensemble model successfully handled image complexity in the plant's surrounding area and recognized nine different diseases and pests. 100,104 Therefore, deep CNNs are promising in automatic classification and detection of diseases traits from leaf images. In addition, CNN has shown optimal performance when implemented in plant-pathogen interaction, pest, and disease recognition in some studies. These studies include prediction of pests and diseases occurrence in cotton 105 ; rice plant diseases and pests recognition 106 ; rice blast disease prediction 107,108 ; image-based potato tuber disease detection 109 and so on. The CNN model is a high-performing method for detecting plant diseases, and it can be implemented and optimized for practical applications.
SVM has also been used for weather-based rice blast prediction and has proven suitable for plant disease forecasting with incredible predictive accuracy. A world-first SVM-based web server for rice blast prediction was developed. Plant scientists and farmers have benefited from this tool, especially in their decision-making. 110 In the pixel-wise quantification and identification of powdery mildew diseased barley tissue, SVM classification was used to establish hypersensitive response spots using multispectral imaging of diseased barley plants. SVM application enabled precise automatic identification of barley interaction with powdery mildew. 111 Recently, a data-driven ML approach named ApoplastP was proposed. RF classifier is the base algorithm for ApoplastP and has shown high performance in predicting protein localization in plant apoplast. At first, differences in the constituents of apoplastic and intracellular plant proteins were unknown. However, the advent of ApoplastP enabled the exploration of differences in the composition of plant proteins. The plant apoplast is integral to plant-pathogen interactions, transport, and intercellular signaling. Hence, integrating and optimizing machine learning algorithms in apoplastic localization prediction will aid functional studies and help predict whether an effector will localize to the apoplast or enter the plant cells. 112 Also, RF has been implemented to build an inter-species protein-protein interaction (PPI) prediction model using Arabidopsis-pathogen PPI data acquired both experimentally and from PPI public databases-UniProt. A critical assessment of the model performance showed that random forest integration with linear statistical methods using sequence information and network attributes as model features resulted in substantial and robust improvement in performance. 24 In addition, RF classification has been used to exploit protein biomarkers' potential in precision breeding using biomarkers generated assays of 104 potatoes (Solanum tuberosum) peptides. These peptides were selected using diagonal linear discriminant analysis, bagging, principal component analysis (PCA), and SVM and then classified with RF classifiers. The ML algorithms' application helped identify Phytophthora infestans resistance in leaves, tubers and its effect on plant yield using potato leaf secretome data. 109 Early disease detection enables farmers to use timely and targeted crop protection strategies. With the use of ML, researchers have improved the accuracy of object detection and recognition systems dramatically.
3.6 Challenges and future outlook ML applications in plant genomics and agronomics have majorly contributed to efficient breeding of crops with desirable agronomic traits, plant phenotyping, genetic trait prediction, and precise disease prediction such as in rice, soybeans, maize, beans, etc. 72,95,106,110 However, several limitations still exist. Firstly, the black-box nature of some sophisticated ML algorithms inhibits interpretation. The plant research community is more interested and fascinated with the biological implications of the prediction than the accuracy of the predictive model. Hence, there is a need for further processing and careful interpretation of the predictive model output using conforming biological knowledge. Additionally, the dimensionality of omics datasets poses challenges such as multicollinearity, overfitting, and sparsity which are difficult to avoid. Though contemporary machine learning methods and the huge sample size can partially alleviate these problems, the model's accuracy can be significantly enhanced by using different fine-tuning, augmentation, and optimization techniques. 107 Data integration from various sources is necessary for GS-assisted breeding and other trait improvement approaches. 113 Simultaneous analysis of multiple omics datasets can advance our understanding of complex biological phenomena. 78,79 Another challenge is the limited and inconsistent information on plant-pathogen interaction phenotypic information. The ML models used in plant disease recognition can be extended by enriching the plant disease database with plant-pathogen interaction phenotype data. Developing more robust classification algorithms with an expanded number of diseases classes will improve plant disease recognition and forecasting. 103,105,108,106 Finally, a comprehensive plant database must be constructed to facilitate comparative studies and promote research collaborations on critical plant science problems.

Conclusion
Machine learning has shown tremendous promise in studying enormous high-dimensional data sets, although it is still limited in plant molecular studies application. An in-depth understanding of ML models will stimulate ML implementation for plant biological data analysis. As sequenced plant genome data continues to accumulate, ML will accelerate all plant genomic research fields, including identifying genes associated with biotic and abiotic stress resistance and other genes with significant functions, understanding gene regulation mechanisms, exploring plant genome genetic framework, and estimating breeding values. These advancements would help agricultural researchers improve the quality and yield of crops with stronger tolerance to abiotic and biotic stress and other plant healththreatening issues.
Data availably statement Extended data OSF: Extended data for "Machine learning algorithms: their applications in plant omics and agronomic traits' improvement", https://doi.org/10.17605/OSF.IO/TE6GC. 14 Files included: Supplementary The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com