ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Review

Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement

[version 1; peer review: 3 not approved]
PUBLISHED 04 Nov 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Plant Science gateway.

This article is included in the Plant Computational and Quantitative Genomics collection.

Abstract

Agronomic traits of plants especially those of economic or aesthetic importance are threatened by climatic and environmental factors such as climate change, biotic, and abiotic stresses. These threats are now being mitigated through the analyses of omics data like genomics, transcriptomics, proteomics, metabolomics, and phenomics. The emergence of high-throughput omics technology has led to an avalanche of plant omics data. Plant research demands novel analytical paradigms to extract and harness large plant omics data for plant improvement effectively and efficiently. Machine learning algorithms are well-suited analytical and computational approaches for the integrative analysis of large unstructured, heterogeneous datasets. This study presents an overview of omics approaches to improve plant agronomic traits and crucial curated plant genomic data sources. Furthermore, we summarize machine learning algorithms and software tools/programming packages used in plant omics research. Lastly, we discuss advancements in machine learning algorithms' applications in improving agronomic traits of economically important plants. Extensive application of machine learning would advance plant omics studies. These advancements would consequently help agricultural scientists improve economically important plants’ quality, yield, and tolerance against abiotic and biotic stresses and other plant health-threatening issues.

Keywords

Agronomic traits, machine learning, multi-omics, plant improvement

1. Introduction

The global agricultural system is threatened by ecological events such as climate change and other environmental stresses.1,2 These events affect the yield, stability, and quality of production of economically important plants, i.e., medicinal plants, fruit crops, food crops, cereal or grain crops, legume seed crops or pulses, etc.1,3 These challenges are addressed by omics approaches via numerous unconventional improvement methodologies.4,5 Omics approaches involve analysis of the constituents of genome sequence and other macromolecules generated from encoded genomic information. The utilization of omics-derived knowledge and technologies in plant improvement strategies is limited and difficult. Other drawbacks of omics technologies include a lack of data integration and effective phenotype-genotype correlation strategies. As a result, promoting the integration of computational biology and plant genomics to assist plant development is critical.6,7 This paper gives an overview of omics methodologies for improving plant agronomic traits as well as essential curated plant genomic data sources. We discuss the bioinformatics software, tools, and packages that are utilized in omics-based plant improvement research. We also dissect how machine learning algorithms has been used to improve agronomic features of commercially significant plants, their major contributions and future outlook in plant omics and agronomics.

1.1 Plant genome sequences and bioinformatics resources

Genome sequencing was made possible with the advent of sequencing technologies.8,9 Complete sequencing of a plant genome was first demonstrated for a model plant named Arabidopsis (Arabidopsis thaliana)10,11 and afterward for rice (Oryza sativa). Subsequently, the whole genome of over 250 species in the plant kingdom have been sequenced: bryophytes, pteridophytes, gymnosperms, and angiosperms12,13 (Figure 1). Angiosperms account for 95% of the sequenced species, most of which are economically important plants or their wild relatives (Figure 2). Food crops like rice, wheat, beans, oat, maize, and soybean are among the sequenced plants, as are ornamental plants like orchid and hibiscus, industrial plants like oilseed, hemp, and spice/herbs like garlic, ginger, turmeric, moringa, artemisia, and neem, which are known for their high therapeutic value.

38e1aaf5-c3ae-407b-827d-e8ff8bdd7711_figure1.gif

Figure 1. Published plant genome sequences from 2000 to date.

Most sequenced plants are angiosperms and are subdivided into three groups. Most of the sequenced angiosperms fall under rosids and asterids clades. Other sequenced angiosperms clades are grouped here as other dicots.

38e1aaf5-c3ae-407b-827d-e8ff8bdd7711_figure2.gif

Figure 2. Percentage of sequenced plants with their common names.

94% of sequenced plants are angiosperms consisting of both monocots and dicots. Percentage of other plants are 1%, 2%, 3% for Pteridophytes, Bryophytes, and Gymnosperms, respectively.

There have been a variety of databases created to access plant genome datasets.14,11 The model plant A. thaliana genome database launched in 2001 was the premier plant genome database.10,15 Subsequently, many databases and resources have been developed for plant genomes. The earliest genome databases were essentially archives of genome sequence data. These databases have expanded into genome portals/hubs that combine different genomic data and web servers that offer online genomics analysis. The availability of annotated plant genome data has led to many discoveries, including genome organization and gene function.16 These discoveries elucidate the complexity, evolution, and dynamics of plant genomes, contributing to a deeper understanding of plant biology.6,17 Available genomic information includes cis-elements, gene expression data, protein interactome, transcriptional and post-transcriptional data. These genome databases exist as single species and comprehensive databases as shown in Table 1.

Table 1. General plant genomics databases and tools.

DatabaseDescriptionWebsite
AgBaseA unified resource for functional analysis in agriculturehttp://www.agbase.msstate.edu/
Ensembl PlantsA genome-centric portal for plant specieshttp://plants.ensembl.org
AutoSNPdbAn annotated single nucleotide polymorphism database for crop plantshttp://autosnpdb.qfab.org.au/
BarleyBaseAn expression profiling database for plant genomicshttp://www.barleybase.org/
CR-ESTA resource for crop ESTs Search for sequence, classification, clustering and annotation data of crop EST projectshttp://pgrc.ipk-gatersleben.de/cr-est/
CSRDBA small RNA integrated database and browser resource for cerealshttp://sundarlab.ucdavis.edu/smrnas/
ChromDBThe Chromatin Databasehttp://www.chromdb.org/
DRASTICx97INSIGHTSQuerying information in a plant gene expression databasehttp://www.drastic.org.uk/
FLAGdb++A Database for the Functional Analysis of the Arabidopsis Genomehttp://urgv.evry.inra.fr/projects/FLAGdb++/HTML/index.shtml
GCPThe Generation Challenge Programmehttp://www.generationcp.org/
GGTGraphical GenoTypes - Software for visualization and analysis of genetic datahttp://www.plantbreeding.wur.nl/
GabiPDIntegrative Plant Omics Databasehttp://www.gabipd.org/
GeneSeqer@PlantGDBGene structure prediction in plant genomes - Predict gene structures of plant genomeshttp://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/PlantGDBgs.cgi
GrainGenesThe genome database for small-grain cropshttp://wheat.pw.usda.gov/index.shtml
GrameneA resource for comparative grass genomicshttp://www.gramene.org/
MIPSAnalysis and annotation of genome informationhttp://mips.gsf.de/
MetaCropA detailed database of crop plant metabolismhttp://metacrop.ipk-gatersleben.de/
NIASGBdbNational Institute of Agrobiological Sciences Gene Bank DataBasehttp://www.gene.affrc.go.jp/databases_en.php
P3DBPlant Protein Phosphorylation Databasehttp://digbio.missouri.edu/p3db/
PHYTOPROTA Database of Clusters of Plant Proteinshttps://urgi.versailles.inra.fr/phytoprot/
PIPA database of potential intron polymorphism markershttp://ibi.zju.edu.cn/pgl/pip/
PLACEPlant cis-acting regulatory DNA elementshttp://www.dna.affrc.go.jp/PLACE/
PLANT-PIs --A database for protease inhibitors and their genes in higher plantshttp://plantpis.ba.itb.cnr.it/
PLecDomPlant Lectin Domains serverhttp://www.nipgr.res.in/plecdom.html
PMRDPlant MicroRNA Databasehttp://bioinformatics.cau.edu.cn/PMRD/
PODBThe Plant Organelles Databasehttp://podb.nibb.ac.jp/Organellome
POGs/PlantRBPA resource for comparative genomics in plantshttp://cas-pogs.uoregon.edu/#/
PREP SuitePredictive RNA Editor for Plantshttp://prep.unl.edu/
PRGDBPlant Resistance Genes DataBasehttp://prgdb.cbm.fvg.it/
PathoPlantxaeA platform for microarray expression data to analyze co-regulated genes involved in plant defense responseshttp://www.pathoplant.de/
PhytomeA platform for plant comparative genomicshttp://www.phytome.org/
Plant MPSS databasesSignature-based transcriptional resources for analyses of mRNA and small RNAhttp://mpss.udel.edu/
Plant snoRNA databaseSearch for comprehensive information on small nucleolar RNAs in plantshttp://bioinf.scri.sari.ac.uk/cgi-bin/plant_snorna/home
PlantCAREA database of plant cis-acting elementshttp://bioinformatics.psb.ugent.be/webtools/plantcare/html/
PlantGDBPlant Genome Database and Analysis toolshttp://www.plantgdb.org/
PlnTFDBDatabase of Plant Transcription Factorhttp://www.softberry.com/berry.phtml?topic=plantprom&group=data&subgroup=plantprom
PlantTribesA gene and gene family resource for comparative genomics in plantshttp://planttfdb.cbi.pku.edu.cn/
PlantTFDBPlant Transcription Factor Databaseshttp://fgp.huck.psu.edu/tribe.html
PmiRKBPlant MicroRNA Knowledge Base - Find information about plant microRNAshttp://bis.zju.edu.cn/pmirkb/
SALADSurveyed contained motif ALignment diagram and the Associating Dendrogramhttp://salad.dna.affrc.go.jp/salad/en/
The Adaptive Evolution Database (TAED)A phylogeny based comparative genomics toolshttp://www.bioinfo.no/tools/TAED
The Plant DNA C-values DatabaseSearch for information on plant DNA C-values and genome sizes.http://data.kew.org/cvalues/homepage.html
The PlantsP Functional Genomics DatabaseSearch for information on plant kinases and phosphataseshttp://plantsp.sdsc.edu/
The PlantsT Functional Genomics DatabasesSearch for genes and proteins involved in plant membrane transportationhttp://plantst.sdsc.edu/
The TIGR Plant Repeat DatabasesA Collective Resource for the Identification of Repetitive Sequences in Planthttp://www.tigr.org/tdb/e2k1/plant.repeats/index.shtml
The TIGR Plant Transcript Assemblies databaseSearch for plant EST and cDNA sequences from this comprehensive collectionhttp://plantta.tigr.org/
TropGENE-DBA Multi-Tropical Crop Information Systemhttp://tropgenedb.cirad.fr/
UK CropNetA collection of databases and bioinformatics resources for crop plant genomicshttp://ukcrop.net/
openSputnik x97A database for annotated expressed sequences tags information and comparative plant genomics analysishttp://sputnik.btk.fi/

1.2 Plant omics technologies

Plant studies involving the analysis of biological macromolecules are collectively termed plant omics. Omics is a broad field of study encompassing subfields like genomics, transcriptomics, proteomics, metabolomics, phenomics, glycomics, lipidomic, etc. Plant genomics involves studying the compositions, organizations, functions, and structures of genetic materials (DNA/RNA) and molecular genetic networks of interactions in the plant genome.17,18 While genome structure and organization are studied in structural genomics,13,19 functional genomics investigates the functions, interaction, and regulation of gene and gene products.5,20

Plant functional genomics is a goldmine in agronomic traits improvement. It incorporates other omics approaches like transcriptomics, proteomics, metabolomics, phenomics, etc.16,21 (Figure 3). Other aspects of genomics are epigenomics, mutagenomics, and pangenomics.22 Epigenetic changes, such as histone modifications, small RNA and DNA methylations occurring at the genomic phase, are dissected within epigenomics. Mutagenomics is used to explore mutation events mediating modified genotype and phenotype in mutant species. Pangenomics studies the whole set of genomic sequences present in the entire population of a species. It also explores dispensable genomes that are individual specific or partially shared. Mutagenomics and pangenomics are new omics techniques in crop sciences.22,23

38e1aaf5-c3ae-407b-827d-e8ff8bdd7711_figure3.gif

Figure 3. Important plant omics branches and their major techniques.

Representation of major omics approaches in plant molecular studies and the methods utilized in conventional analysis of plant omics datasets.

Plant transcriptomics involves investigating the control of plant metabolite production processes at the RNA level. Transcript-level gene expression control regulates the whole plant's development and growth.24 Plant proteomics explores the structural and functional features of proteins in a living organism. It encompasses studies on plants’ typical morphological and physiological properties.16,22 The role of proteins in controlling the plant metabolic processes is also studied in plant proteomics, especially in medicinal plants.25

Plant metabolomics involves profiling primary and secondary metabolites in plants.26 Metabolic data are useful when developing metabolic correlation networks. These networks can aid comparative analysis of cellular compartments such as carbon and nitrogen transport and partitioning in plants.27 In addition, the molecular and cellular regulation of different enzymatic processes can also be investigated.25 Plant phenomics involves the systematic study of phenotypes such as plant composition, growth, and production analysis. This study can be conducted both in controlled environments and in the field. Field phenomics involves the measurement of phenotypes that exist under both cultivated and natural conditions. Studies in controlled environments involve glasshouses, growth chambers, and other systems where growth conditions can be manipulated.28 These multi-omics approaches have emerged successful for plant research, including agronomic traits improvement over the last few decades. Agronomic traits are desirable plants’ genetic or phenotypic features, i.e., quality traits, disease resistance, pest resistance, insecticide tolerance, temperature, drought, and other adverse environmental factors tolerance traits. Quality traits encompass morphological features like plant height, seed weight; physiological features like chlorophyll content and photosynthetic rate29; economic features like improved crop yield, processing, and storage; pharmaceutical and industrial features like the elimination of toxins and allergen, increased nutritional or dietary value and increased medicinal values.30 Recent research suggests that when multi-omics technologies are integrated, they can be better harnessed to improve genetic development, crop breeding science, plant stress resistance, and other agronomic traits.22,23

2. Major areas of application of omics technologies for agronomic traits improvement

2.1 Genomics-assisted pre-breeding

Genomics-assisted pre-breeding is a genetic manipulation strategy to improve agronomic traits of interest in plants at the DNA level.31 Genomics-assisted pre-breeding approaches positively contribute to the efficiency of diseases and climate-resilient crop development.3234 Crop breeding across the globe has relied on a series of phenotypic selection and crossing before the genomic era to generate superior crop genotypes.35 Genome sequence availability has paved the way for identifying all genes and genetic variants associated with agronomics traits.36,37 Besides, it has made it possible to assess genotype level changes incurred during breeding processes.38 Plant breeders have utilized genomics and bioinformatics in gene-level resolution of agronomic variation using quantitative trait loci (QTL) mapping3941 and genome-wide association studies (GWAS).42,43 For instance, studies have recently been conducted to develop multiple stress-adaptable rice species that are disease and climate resilient using genomics-assisted breeding techniques such as quantitative trait locus (QTL), gene/markers-phenotype association and phenotype selection.4446 Pea breeding projects used genetic marker-trait associations to boost valued yield and market-preferred agronomic traits.47,48 Miedaner et al.43 used high-density genotype arrays and comprehensive phenotyping of the same species population across diverse conditions, locations, and seasons in genomic selection and population mapping to speed the breeding of disease resistance traits in maize, small-grain cereals, and wheat. Hu et al.39 harnessed genomic selection (GS) and genome-wide genetic variants to prevent reiterated phenotyping in breeding cycles. These studies indicate new breeding techniques such as speed breeding and genomic selection to boost genetic and trait improvements. However, the lack of robust phenotypic data limits the efficient utilization of available genomic information and technologies in genomics-assisted breeding.

2.2 Evolution and crop diversity

Variation in gene content among individuals within the same species is caused by genetic variation ranging from single-nucleotide polymorphisms to substantial structural variants (SVs). Due to human and natural selection acts, this variation offers the raw material on which evolution occurs.49,50 Deviation in agricultural plants’ phenotypic and genetic characteristics is referred to as crop diversity.36 The understanding of crop diversity is enhanced by plant genomics at both species and gene levels.51 According to recent research, a single reference genome is insufficient to capture a species' entire genetic diversity landscape. Pan-genome analysis provides a platform for evaluating a species' genetic diversity by looking at its whole genomic repertoire. Pan-genomic studies have shed new light on the landscape of diversity and improvement of major crops such as Brachypodium distachyon,52 Brassica Spp.,5355 maize,56 rice,57,58 soybean,59 wheat60 etc. Evolution in plant diversity is correlated with and relatively predictable by heterogenous biotic and abiotic environmental stress induced by global climate change. These stresses, in turn, affect crop yield and crop-growing seasons.61 A study on natural plant populations shows that the organization and evolution of plant populations’ diversity at all genomic regions is nonrandom at the molecular and organismal level.62,63 Therefore, plants can evolve under climatic gradients resulting in clinal adaptation. Hence, the breeding of climate resilience crops can be facilitated by understanding the genomic basis of clinal adaption in crop species.64

2.3 Abiotic and biotic stresses

Biotic stresses are instigated by living organisms such as insects, parasitic plant nematodes, diseases, or weeds in production agriculture.65 Genomics approaches to biotic stress include ribonucleic acid interference (RNAi) silencing and transgenesis. RNA interference (RNAi) silencing is employed against viruses and some fungi, while transgenesis has been exploited to develop resistance against some fungi, for example, Fusarium head blight.66 Genome-wide identification and expression analysis in legume crops also revealed the role of small RNA biogenesis mediators in biotic stress response regulation.42

Abiotic stresses, such as low or high temperatures, heavy metals, insufficient or excessive water, high salinity, ultraviolet radiation, are hostile to developing plants, resulting in significant wane in crop yield worldwide.67 According to Kumar et al.,28 knowledge of plants' response to abiotic stresses can be enhanced by integrating information generated from metabolomics and proteomics with genomics data. Sustenance of yield in crops threatened by abiotic stresses is a significant challenge in breeding resilient crop varieties.68 A study on Brassica oleracea shows that heat stress transcription factors are integral to signal transduction pathways functioning in response to environmental stresses and are suggested to contribute significantly to various stress responses.62 Heat stress transcription factors genes were identified in the in silico analysis of B. oleracea. The identified genes may be exploited in developing crop varieties resilient to global climate change.1,62

2.4 Population studies

Population genomics is employed to study adaptation and speciation. Population genomics datasets are used in GWAS to detect the genes responsible for adaptive phenotypic variations of large plant population samples.35,69 For instance, Bamba et al. identified specific adaptation loci in a GWA study and unveiled the molecular basis of genetic trade-offs. It also showed that ecological fitness could be predicted by polygenic effects of several loci associated with local climate.35 Medicinal plants’ diversity is of exceptional interest because of their ethnomedicine role. GWAS studies on adaptive genotypic and phenotypic variation provide a framework to assess the diversity of medicinal plant application across different cultures and infer modifications in plant use over time.70,71 Other genomic approaches such as genomic selection, nested association mapping, genetic diversity, and allele mining have been integrated into crop improvement programs to address the genetic issues associated with maize productivity and nutritional contents.72

Plant omics studies have greatly helped our understanding and interpretation of plant responses to ecological influences and their contribution to key developmental processes important for crop yield and food quality. However, there are still some problems, such as a lack of data integration and robust techniques for phenotype-genotype correlation. Also, the use of omics-derived knowledge and tools in plant improvement strategies is limited and difficult. As a result, there is a pressing need to promote the integration of computational biology and plant genomics to benefit plant improvement.6,7

3. Applications of machine learning in plant omics and agronomics

3.1 Machine learning algorithms and resources

Machine learning (ML) is a computer science field that utilizes algorithms to learn and capture the characteristics of target patterns of complex datasets.73 Machine learning algorithms are generally classified into the following categories; supervised, semi-supervised, unsupervised, reinforcement, and deep learning.74 A supervised ML algorithm is trained using a labeled dataset. It learns to respond more accurately based on these training sets by comparing its output with the given input.67,73 Semi-supervised algorithms provide a tool that harnesses the potential of both supervised and unsupervised learning. These algorithms are ideally adapted for model building and can be used for classification, regression, and prediction.75 Unsupervised learning is all about identifying unexplained existing patterns from the data to generate pattern rules. Unsupervised learning is a learning approach focused on statistics and thus applied to the issue of discovering a hidden structure in unlabeled data.7,74,76

Reinforcement learning is considered an intermediate form of learning as the algorithm is only provided with an answer that tells whether the output is correct or not.75 Deep learning is built on artificial neural networks (ANN). The algorithms extract higher-level features from the raw input using multiple layers of neural networks. Learning of the algorithm can be unsupervised, semi-supervised, or supervised.73 Machine learning approaches provide unique techniques for integrating and analyzing omics data, allowing for the improvement of crops and other economically important plants. Some machine learning algorithms have been used to developed tools specifically for plant omics analysis. Table 2 highlights the existing machine learning tools for plant omics analysis. Machine learning algorithms have a broad range of applications in plant genomics. These algorithms play vital roles in genome assembly, iterative gene regulatory network inference, and the identification of true SNPs in polyploid plants.77

Table 2. Existing machine learning tools for plant omics data analysis.

Application areaDeveloped toolsURLAlgorithmsSelected features
Plants Mitochondrially Localized Proteins PredictionMU-LOC78http://mu-loc.org/SVM and DNNgene co-expression information, protein position weight matrix, amino acid compositions, and N-terminal sequence information,
plant resistance protein NBSLRR predictionNBSPred79http://soilecology.biol.lu.se/nbs/SVMR-protein and non-R-protein sequences attribute like sequence domain and compositional frequencies
Ribosomal proteins (RPs) predictionRAMA80http://inctipp.bioagro.ufv.br:8080/Rama.MLP, RF, and NBAmino acid side chains attributes
Plant disease resistance proteins predictionDRPPP81http://14.139.240.55/NGS/download.phpSVM, MLP, and RFgenomic sequence (satellite DNAs)
Geminiviruses Gene and genera classificationFangorn Forest (F2)82www.geminivirus.org:8080/geminivirusdw/discoveryGeminivirus.jsp.SVM, MLP, and RFgenomic sequence (satellite DNAs)
Transcriptomes for stress responses in ArabidopsismIDNA83www.plantcell.org/content/26/2/520.short#def-8RF with PSOL algorithm, SVM, and NNpatterns of 32 known stress-related gene expression traits and the complementary expression characteristic

3.2 Precision plant breeding

Precision breeding is a genetic engineering technique that involves reproducing organisms of the same species together to preserve desirable characteristics and create a stronger hybrid.84 Traditional statistical methods mainly used in plant breeding strategies are ineffective in plant data analysis because of the non-deterministic and nonlinear nature of plant features attributed to environment, genotype, and interaction.85 Machine learning has enabled effective plant phenotyping and data mining for patterns such as genotype and trait correlation.39 It has also been successful in genomic selection. Genomic selection is a critical method used in selecting plant species with genetic gains of interest in plant breeding. Applying different ML algorithms in building GS models has produced robust and accurate prediction.86,87 Multilayer neural networks (NNs) have been used in genetic value prediction in plant breeding. NN models are efficient in predicting genetic value, regardless of the population size, heritability, or coefficient of variation. Thus, the ANN is promising for genetic value prediction in unbiased experiments.88 Multilayer NNs have been utilized to select bean genotypes with highly stable phenotypes using 13 genotypes of common beans between 2002 and 2006. The integration of this model in plant breeding has enabled precise genetic value prediction and selection.89

Also, deep learning generated a robust prediction accuracy in grain yield compared to the conventional linear statistical methods used in traditional plant breeding when analyzing multiple traits with mixed ordinal, continuous, and binary phenotype data. Univariate and multivariate deep learning models' predictive performance was assessed using the Durum wheat (Triticum turgidum var. durum Desf.) dataset. Deep learning model performance shows that it has a promising potential to be a successful model for accurate genomic prediction in plant breeding.90 The flexibility of machine learning algorithms makes them a viable alternative to traditional parametric methods for predicting categorical and continuous responses in genomic selection.91

An ensemble of RF and SVM was implemented to improve genotype-phenotype classification using manually derived root trait datasets. The combined model accurately identified the most distinguishing root traits and corresponding cultivar differentiation. The model's performance demonstrates the potential of ML approaches in unbiased cultivars classification and trait selection.92

Additionally, predictive models have aided the integration of additive and dominance effects in GWAS and have enhanced the prediction of complex agronomic traits in polyploid plant species. For instance, a study revealed the feasibility of genome-wide prediction of potato agronomic traits despite being an autotetraploid food crop. It also shows that GWA prediction is viable in selecting breeding values in elite germplasm with substantial non-additive genetic variance.93

3.3 Phenomics

Plant phenomics is a systematic study of plant phenotypes.28 In recent years, plant field phenotyping has gotten a lot of attention with the possibility of crop fields' high-throughput analysis.94 The application of machine learning methods and the various technological developments for image analysis have improved quantitative crop traits assessment.92,94,95 For instance, CNN-based detection and analysis of wheat spikes using wheat field trials images captured over one planting season achieved an average accuracy of 88 to 94% across diverse groups of test images. CNN's high-performance accuracy shows that it is a robust model for genome-based selection and prediction in plant breeding.96 Also, the RF algorithm was used in plant image segmentation involving the acquisition and processing of several plant images samples.97 The predictions made by the model enabled the discovery of various parameters relevant to plant growth.94

3.4 Stress resilience phenotyping

ML learning has been exploited in identifying favorable agronomic traits, including abiotic and biotic stress resistance. ML algorithms are integrated into conventional statistical methodologies to optimize the accuracy of plants stresses prediction and detection.96 For instance, 25,000 soybean leaflets images exposed to varied diseases and nutritional perturbation were used to develop a convolutional neural network (CNN), which can infer the image features of the disease types and dietary deficiencies at high resolution. The prediction accuracy of the ML framework was very close to that of human expert diagnosis. Other plants’ induced stresses can also be identified, classified, and quantified using the model. The model can also be adapted to identify, classify, and quantify the induced stresses in other plants.95

Random forest has been used to predict metabolite and transcript markers in drought tolerance prediction using experimental drought-stressed plant field trial datasets. The low error rate recorded in the model shows that the model could be considered as an alternative model for accurate prediction and identification of molecular markers.98 RF was used to identify suitable features combination for phenotypic traits prediction using data derived from various agro-management treatment experiments. This approach achieved optimal prediction accuracy and improved plant breeding strategies by enabling maximal allocation of stress management resources.99 Sanz-Carbonell and colleagues used deep sequencing and computational approaches such as PCA and Clustering analysis to infer the biotic and abiotic stress responses regulatory network mediated by miRNA. 24 miRNAs were used in this study, all of which are known to alter expression significantly under stressful conditions. The prediction generated inference that target genes of miRNAs down-regulated under stress conditions contribute to plant response to stress, whereas miRNAs that are up-regulated control genes associated with growth and development.67 Soybean fields were screened for tolerance to soybean iron chlorosis deficiency (abiotic stress in soybean) using linear discriminant analysis (LDA) and SVM. The phenotypic data obtained from soybean fields were used in model training and predicting soybeans' iron chlorosis deficiency stress. The ML application has helped evaluate the severity of real-time stress in the soybean sector.95

3.5 Plant–pathogen interaction and diseases prediction

Plant diseases and pests pose a significant threat to agriculture. Early identification of plant diseases and pests would aid in developing effective treatment strategies while economic losses are mitigated.100 Diverse ML approaches for precise disease recognition and prediction have been implemented in plant populations.101,102 Neural networks (NNs) have achieved impressive results in plant disease prediction using image classification. A deep convolutional network was implemented in leaf image classification model for disease recognition. The developed model showed a high predictive performance in distinguishing plant leaves from their surroundings and recognized 13 plant diseases types on healthy leaves.103 In another approach, a heterogeneous ensemble of deep-learning-based neural network models was used in detecting tomato plants diseases and pests using images collected on-site by imaging devices of varying resolutions. The ensemble model successfully handled image complexity in the plant's surrounding area and recognized nine different diseases and pests.100,104 Therefore, deep CNNs are promising in automatic classification and detection of diseases traits from leaf images. In addition, CNN has shown optimal performance when implemented in plant–pathogen interaction, pest, and disease recognition in some studies. These studies include prediction of pests and diseases occurrence in cotton105; rice plant diseases and pests recognition106; rice blast disease prediction107,108; image-based potato tuber disease detection109 and so on. The CNN model is a high-performing method for detecting plant diseases, and it can be implemented and optimized for practical applications.

SVM has also been used for weather-based rice blast prediction and has proven suitable for plant disease forecasting with incredible predictive accuracy. A world-first SVM-based web server for rice blast prediction was developed. Plant scientists and farmers have benefited from this tool, especially in their decision-making.110 In the pixel-wise quantification and identification of powdery mildew diseased barley tissue, SVM classification was used to establish hypersensitive response spots using multispectral imaging of diseased barley plants. SVM application enabled precise automatic identification of barley interaction with powdery mildew.111

Recently, a data-driven ML approach named ApoplastP was proposed. RF classifier is the base algorithm for ApoplastP and has shown high performance in predicting protein localization in plant apoplast. At first, differences in the constituents of apoplastic and intracellular plant proteins were unknown. However, the advent of ApoplastP enabled the exploration of differences in the composition of plant proteins. The plant apoplast is integral to plant–pathogen interactions, transport, and intercellular signaling. Hence, integrating and optimizing machine learning algorithms in apoplastic localization prediction will aid functional studies and help predict whether an effector will localize to the apoplast or enter the plant cells.112

Also, RF has been implemented to build an inter-species protein–protein interaction (PPI) prediction model using Arabidopsis–pathogen PPI data acquired both experimentally and from PPI public databases-UniProt. A critical assessment of the model performance showed that random forest integration with linear statistical methods using sequence information and network attributes as model features resulted in substantial and robust improvement in performance.24

In addition, RF classification has been used to exploit protein biomarkers' potential in precision breeding using biomarkers generated assays of 104 potatoes (Solanum tuberosum) peptides. These peptides were selected using diagonal linear discriminant analysis, bagging, principal component analysis (PCA), and SVM and then classified with RF classifiers. The ML algorithms' application helped identify Phytophthora infestans resistance in leaves, tubers and its effect on plant yield using potato leaf secretome data.109 Early disease detection enables farmers to use timely and targeted crop protection strategies. With the use of ML, researchers have improved the accuracy of object detection and recognition systems dramatically.

3.6 Challenges and future outlook

ML applications in plant genomics and agronomics have majorly contributed to efficient breeding of crops with desirable agronomic traits, plant phenotyping, genetic trait prediction, and precise disease prediction such as in rice, soybeans, maize, beans, etc.72,95,106,110 However, several limitations still exist. Firstly, the black-box nature of some sophisticated ML algorithms inhibits interpretation. The plant research community is more interested and fascinated with the biological implications of the prediction than the accuracy of the predictive model. Hence, there is a need for further processing and careful interpretation of the predictive model output using conforming biological knowledge. Additionally, the dimensionality of omics datasets poses challenges such as multicollinearity, overfitting, and sparsity which are difficult to avoid. Though contemporary machine learning methods and the huge sample size can partially alleviate these problems, the model’s accuracy can be significantly enhanced by using different fine-tuning, augmentation, and optimization techniques.107 Data integration from various sources is necessary for GS-assisted breeding and other trait improvement approaches.113 Simultaneous analysis of multiple omics datasets can advance our understanding of complex biological phenomena.78,79 Another challenge is the limited and inconsistent information on plant-pathogen interaction phenotypic information. The ML models used in plant disease recognition can be extended by enriching the plant disease database with plant-pathogen interaction phenotype data. Developing more robust classification algorithms with an expanded number of diseases classes will improve plant disease recognition and forecasting.103,105,108,106 Finally, a comprehensive plant database must be constructed to facilitate comparative studies and promote research collaborations on critical plant science problems.

4. Conclusion

Machine learning has shown tremendous promise in studying enormous high-dimensional data sets, although it is still limited in plant molecular studies application. An in-depth understanding of ML models will stimulate ML implementation for plant biological data analysis. As sequenced plant genome data continues to accumulate, ML will accelerate all plant genomic research fields, including identifying genes associated with biotic and abiotic stress resistance and other genes with significant functions, understanding gene regulation mechanisms, exploring plant genome genetic framework, and estimating breeding values. These advancements would help agricultural researchers improve the quality and yield of crops with stronger tolerance to abiotic and biotic stress and other plant health-threatening issues.

Data availably statement

Extended data

OSF: Extended data for “Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement”, https://doi.org/10.17605/OSF.IO/TE6GC.14

Files included:

Supplementary Table 1. Published sequenced plant genomes. Hundreds of plant genomes have been sequenced and published since 2000. The statistics for each genome are taken from the publication, despite several model plants having significant updates to genome assemblies and gene counts. NA, data not available in publication; Mb, megabases; kb, kilobases.

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 04 Nov 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Isewon I, Apata O, Oluwamuyiwa F et al. Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement [version 1; peer review: 3 not approved]. F1000Research 2022, 11:1256 (https://doi.org/10.12688/f1000research.125425.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 04 Nov 2022
Views
8
Cite
Reviewer Report 15 Dec 2023
Haixiao Hu, Plant Sciences, University of California Davis, Davis, California, USA 
Not Approved
VIEWS 8
The authors endeavored to review the applications of machine learning algorithms in plant omics and agronomic trait improvement. In Section 1, the authors enumerated various genomics databases and identified five subfields within plant omics technologies. In Section 2, the utilization ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hu H. Reviewer Report For: Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement [version 1; peer review: 3 not approved]. F1000Research 2022, 11:1256 (https://doi.org/10.5256/f1000research.137728.r220199)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
12
Cite
Reviewer Report 24 Nov 2023
Hao Tong, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany 
Not Approved
VIEWS 12
The review paper prepared by Isewon et al. tried to summarize how machine learning models applied to agriculture. This manuscript is not well-structured, many sentences are unclear for me. There are many review papers on this topic in different particular ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Tong H. Reviewer Report For: Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement [version 1; peer review: 3 not approved]. F1000Research 2022, 11:1256 (https://doi.org/10.5256/f1000research.137728.r220205)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
24
Cite
Reviewer Report 29 Sep 2023
Aalt-Jan van Dijk, Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, The Netherlands 
Not Approved
VIEWS 24
A variety of reviews on omics data analysis and machine learning (ML) in the fields of plant science and plant breeding have recently been published. It is unclear where exactly this manuscript positions itself with respect to these existing reviews. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
van Dijk AJ. Reviewer Report For: Machine learning algorithms: their applications in plant omics and agronomic traits’ improvement [version 1; peer review: 3 not approved]. F1000Research 2022, 11:1256 (https://doi.org/10.5256/f1000research.137728.r205805)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 04 Nov 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.