ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article
Revised

Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection

[version 2; peer review: 1 approved with reservations, 3 not approved]
PUBLISHED 28 Aug 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Genomics and Genetics gateway.

This article is included in the Bioinformatics gateway.

This article is included in the Plant Computational and Quantitative Genomics collection.

Abstract

Background

Rhizomania counts as the most important disease in sugar beet Beta vulgaris L. for which no plant protection is available, leaving plant breeding as the only defence strategy at the moment. Five resistance genes have been detected on the same chromosome and further studies suggested that these might be different alleles at two resistance clusters. Nevertheless, it was postulated that rhizomania resistance might be a quantitative trait with multiple unknown minor resistance genes. Here, we present a first attempt at genomic prediction of rhizomania resistance in a population that carries resistances at the two known resistance clusters. The sugar beet population was genotyped using single nucleotide polymorphism (SNP) markers.

Methods

First, genomic prediction was performed using all SNPs. Next, we calculated the variable importance for each SNP using machine learning and performed genomic prediction by including the SNPs incrementally in the prediction model based on their variable importance. Using this method, we selected the optimal number of SNPs that maximised the prediction accuracy. Furthermore, we performed genomic prediction with SNP pairs. We also performed feature selection with SNP pairs using the information about the variable importance of the single SNPs.

Results

From the four methods under investigation, the latter led to the highest prediction accuracy. These results lead to the conclusion that more than the two known resistance clusters are involved in rhizomania resistance and that genetic interactions affect rhizomania resistance. Finally, we have analysed which SNPs were repeatedly detected in the feature selection process and discovered four SNPs, two of which are located on chromosomes that were previously not associated with rhizomania resistance.

Keywords

Epistasis, genomic prediction, machine learning, rhizomania, resistance breeding, Beet necrotic yellow vein virus, variable importance

Revised Amendments from Version 1

One of the key improvements made is the enhancement of the language throughout the manuscript. This was done to improve clarity and readability, ensuring that our findings are communicated as effectively as possible.

Additionally, in response to the reviewers' suggestions, we conducted a further analysis to determine which SNPs were selected at least 50% of the time by feature selection. This analysis utilised the results from genomic prediction to identify SNPs associated with rhizomania resistance. With this approach, we identified four SNPs, two of which are located on chromosome 3. Interestingly, we also identified one SNP on chromosome 2 and another on chromosome 5 which had not previously been linked to rhizomania resistance. Our findings suggest that although the individual effects of these SNPs are modest, they significantly influence resistance when combined with the SNPs on chromosome 3. Thus, our research demonstrates that a non-additive interaction between SNPs on different chromosomes is affecting rhizomania resistance.

We believe these revisions have strengthened our manuscript and hope they provide greater insight into our research.

See the authors' detailed response to the review by Daniela Holtgräwe
See the authors' detailed response to the review by Muhammad Massub Tehseen
See the authors' detailed response to the review by J Mitchell McGrath

Introduction

Sugar beet (Beta vulgaris L.) is an important crop to secure production of white sugar, especially in industrialised countries.1 Globally, sugar beet accounts for approximately 20% of sugar production.2 In addition to achieving high sugar yields, resistance to diseases, of which rhizomania counts as the most important one, is the main goal of sugar beet breeding.3

Rhizomania is caused by the Beet necrotic yellow vein virus (BNYVV)4 and is transmitted via the fungus Polymyxa betae Keskin.5,6 Severe infection with rhizomania can reduce sugar yield by up to 90%.7 Moreover, Abe and Tamada (1986) have shown that rhizomania can persist in resting spores of P. betae for over fifteen years, making a decontamination through an enlarged crop rotation nearly impossible.8 Furthermore, there is no pesticide available for plant protection,9 leaving resistance breeding as the only defence strategy at the moment.10

Since the first observation of rhizomania in 1951 in northern Italy, rhizomania has spread globally and is now present in all major sugar beet growing regions.11 While strain groups of BNYVV with four ribonucleic acid (RNA) strands are spread globally,12 BNYVV strains with five RNA strands have been found in certain regions in France,13 Japan,14 the UK,15 Kazakhstan,16 and Turkey.7 Comparative studies have demonstrated that pathotypes of BNYVV with five RNA strands showed significantly higher levels of infection in partially resistant sugar beet varieties than pathotypes with four RNA strands.17

The first breeding projects against rhizomania started in 197018 and resulted in the publication of three resistance genes in 1987 called Rizor,19 Holly,20 and WB42.20 Nevertheless, further analyses of the resistance genes Rizor and Holly indicated that these are probably the same gene, henceforth called Rz1.21 The resistance gene WB42 which is also often referred to as Rz2, however, was assumed to be a further resistance gene independent from Rz1 with an approximate distance of 20 cM between Rz1 and Rz2.22 Recent studies have confirmed the presence of the Rz2 resistance gene in wild sugar beet relatives and identified a stop codon in Rz2 in susceptible genotypes which is absent in resistant genotypes.23 While the resistance gene Rz1 is specific for BNYVV, recent studies show that Rz2 also provides a resistance against the Beet soilborne mosaic virus and the Beet soilborne virus by recognising the triple gene block protein 1.24 In subsequent years, three further resistance genes were published called Rz3,25 Rz4,26 and Rz5.27

Although five resistance genes against rhizomania have been published, doubts have been raised on whether all resistance genes are in fact separate genes or rather alleles of the same genes. All five resistance genes were located on chromosome three18,26,27 where mainly two clusters emerged.28 McGrann et al. (2009) assumed that the resistance against rhizomania may be mainly explained by two loci, with the first locus being represented by Rz1, Rz4, and Rz5, and the second locus being represented by Rz2 and Rz3.12 Although only two resistance clusters against rhizomania are known, it is assumed that rhizomania resistance is a quantitative trait caused by multiple loci with different effects, which have not yet been identified.18

It has been suggested that asymmetric variation in quantitative traits may be due to epistasis,29 defined as non-additive gene interaction.30 Analysing interactions involving more than two genes is challenging due to the computational complexity and the requirement to have large enough samples for each subgroup.31 Although it has been demanded to analyse epistasis in complex trait studies32,33 and epistasis has been analysed for numerous traits in sugar beet,34 no such study has been conducted for rhizomania resistance to date.

It is generally assumed that a plant’s resistance towards diseases is quantitative and caused by a complex network of multiple loci.35,36 In such cases, genomic prediction is a useful tool to predict an individual’s resistance towards the disease. Such studies have been performed, for example, in soy bean,37 barley,38 rapeseed,39 rice,40 wheat,41 and maize.42,43 Although rhizomania resistance is believed to be a complex trait caused by multiple loci, genomic prediction of rhizomania resistance has not yet been published. Here, we present the first study of genomic prediction of rhizomania resistance.

Methods

Experimental design and data preparation

The sugar beet population for this trial was developed by crossing two sugar beet lines and self-pollinating the resulting hybrids twice. This process resulted in a population of 155 S2 plants. These plants were genotyped using a customised SNP chip and subsequently self-pollinated. For each plant, 15 seeds were used as genotypes for this trial. Analysis of the SNP chip data revealed that each of the 155 genotypes was homozygous for resistance at both Rz1 and Rz2. Additionally, the population was expanded by including 15 seeds from a sugar beet line homozygous for resistance at Rz1 but not at Rz2.

Plants were grown for ten weeks in the greenhouse in soil infested with BNYVV, pathotype P. This variant of BNYVV contains five RNA strands45 and is more aggressive than the variants with four RNA strands.17 After ten weeks, plants were removed from the soil and plant sap from lateral roots was extracted. Afterwards, the optical density (OD) value of each sample was measured using the double antibody sandwich enzyme-linked immunosorbent assay (DAS-ELISA). The OD values were measured after 60, 90, and 120 minutes using the Infinite F50® (Tecan Group AG, Männedorf, Switzerland) at a wavelength of 405 nm. Harvest, sample preparation and conduction of the DAS-ELISA test followed the protocol described in.46

Although DAS-ELISA is a commonly used tool to measure the concentration of BNYVV in samples from sugar beet,7,10,23 it does not directly measure the virus concentration in form of the OD values.47 To estimate the virus concentrations from the raw OD values as well as to reduce measurement errors for each 96-well plate, we transformed the non-normally distributed raw data to normally distributed data using an inverse logistic regression model.48 The logistic regression model was derived using a serial dilution with 12 samples on each 96-well plate. The transformation followed the protocol in Ref. 48, with the adjustment that OD values were measured after 60, 90, and 120 minutes. At each of the three time points, the relationship between OD values and virus concentration was modelled using the logistic regression model in equation 1:

(1)
OD̂i=bc˜+tlbc˜1+2SldIldCiAldCî=ldItlbc˜ODibc˜1A11S

with ODi being the OD value of sample i , Ci being the virus concentration of sample i, bc˜ being the median of the buffer controls on each 96-well plate, tl being the technical limit of the machine, A describing the asymmetry of the curve, I being the relative virus concentration (Ci) at the inflection point if A=1, and S being the slope at the inflection point. Both A and S can be set freely with the lower limit of zero. Since the OD values were measured at three time points, each sample also provided three transformed values. Since the transformed data can be assumed to be normally distributed, the mean of the transformed data was calculated and used as the response variable to reduce technical errors during measurement at each time point. After transformation of the data, the mean of the transformed data was calculated for all plants descended from the same parent. Therefore, if no plants died during the trial, the mean from 15 plants was calculated as the phenotypic data point for the corresponding genotype.

After transformation and calculating the mean of the transformed values, the SNPs were prepared. SNPs with missing values were removed from the data set. Moreover, redundant SNPs were removed through linkage disequilibrium (LD) pruning. In this step, one of two SNPs that were correlated with more than an r2 of 0.95 was removed to ensure that epistasis results were not confounded by LD.33,49 Furthermore, SNPs with a major allele frequency of 0.95 or higher were removed as recommended in Refs. 50, 51 for genomic prediction studies. In this filtering step, it was ensured that only SNPs with a certain genomic variance in the population remained in the data set. After the final step of SNP filtering, 9,127 SNPs were kept in the data set. Finally, the remaining SNPs were recoded as 0 (homozygous major allele), 2 (homozygous minor allele), and 1 (heterozygous). This coding approach is recommended for analysing genotypic and additive genetic models.52 All filtering steps and the SNP recoding were performed using PLINK v1.90b6.10.53

Genomic prediction and feature selection using single SNPs

After data were prepared, genomic prediction was performed using single SNPs. A total of 125 genotypes (representing 80% of the population) were randomly chosen as the training population, with 31 genotypes designated as the test population. This process, based on experimental designs used in previous studies involving feature selection,54 was repeated ten times. Subsequently, genomic prediction was performed using random forest as it was recommended in the literature for genomic predictin in sugar beet.55 To do so, the R package ranger, version 0.14.1, was used with default settings.56

Prediction accuracy was assessed by predicting the test data set using a model derived from the training data set. Subsequently, the coefficient of determination (R2) was used to compare the predicted and the observed values in the test data set. The coefficient of determination is defined as the proportion of the explained variability of the total variability57 and was used in previous studies as measure for the prediction accuracy.58,59

To perform feature selection, the variable importance of each SNP was estimated using the function Boruta from the R package Boruta, version 7.0.0.60 Boruta runs multiple iterations of random forest to assess the importance of each variable (in this context, SNPs). During these iterations, the importance of a variable is determined by measuring the decrease in prediction accuracy or increase in prediction error when the values of that variable are permuted or randomly shuffled. The more significant the decrease in accuracy, the higher the variable's importance. Boruta then compares the importance of each actual variable to the importance of randomly permuted versions (shadow variables).60 This method has been used in earlier studies to evaluate SNP variable importance.58,61,62 We chose the mean variable importance as estimate of the variable importance per SNP. We used a confidence level of 10−10, 2,000 trees and 200 as the maximum number of runs.

After the variable importance per SNP was estimated, feature selection was performed by carrying out genomic prediction with a random forest model that contained only the two SNPs with the highest variable importance. Subsequently, genomic prediction was performed using the three best SNPs, and so on. For each number of SNPs in the prediction model, the prediction accuracy was calculated. Since random forest can lead to different results in prediction accuracy even if the same data were used as training and test data set, prediction accuracy was estimated as the median prediction accuracy from ten repetitions. To prevent over-optimistic values for the prediction accuracy, variable importance was only estimated using the training data set.63 In this way, genomic prediction using feature selection can be compared to genomic prediction using all available SNPs.54,64

Performing the feature selection method as described above led to a variable importance per SNP as well as a prediction accuracy for each prediction model containing the i best SNPs for each of the ten random splits of the data set. In a next step, the prediction accuracy for each number of SNPs was defined as the median from these ten repetitions. Subsequently, the optimal number of SNPs was defined as the number of SNPs that maximised the median prediction accuracy from the ten repetitions.

Genomic prediction and feature selection using SNP pairs

In addition to genomic prediction with single SNPs, genomic prediction was also performed using SNP pairs. Since the genomic data set contained 9,127 SNPs after the data preparation and filtering, theoretically more than 41 million SNP pairs could be created out of these single SNPs. To reduce the resulting data set to a managable size, PLINK’s epistasis test was performed using default settings, testing the interaction term of each SNP pair for significance at a significance threshold of α=0.0001.65 The selection of SNP pairs was performed with each training data set individually to prevent bias during the selection process.

After the SNP pairs were selected using PLINK’s epistasis test, the genotype of each SNP pair was defined using an additive-additive interaction model. In this way, the genotype of each SNP pair was defined as the product of both single SNPs.52 For instance, if any of the single SNPs was homozygous for the major allele (coded as 0 in the single SNPs), the genotype of the SNP pair was defined as 0. This was the case for five of the nine possible genotypes of a SNP pair. The combination of two heterozygous single SNPs would lead to a 1 for the SNP pair, the combination of a heterozygous SNP with a SNP that provides a homozygous minor allele would be 2, and the combination of two SNPs that provide homozygous minor alleles would be 4. The recoding of single SNPs as well as the resulting genotype of the SNP pair is summarised in Table 1.

Table 1. Recoding of two theoretical SNPs as well as the resulting SNP pair according to Ref. 52.

A and B represent major alleles and a and b represent minor alleles for SNP A and SNP B, respectively. The resulting SNP pair corresponds to the product of the single SNPs in an additive-additive SNP-interaction model.

GenotypesCoding for SNP ACoding for SNP BCoding for SNP pair
AA/BB000
AA/Bb010
AA/bb020
Aa/BB100
Aa/Bb111
Aa/bb122
aa/BB200
aa/Bb212
aa/bb224

After the SNP pairs were recoded, genomic prediction was performed with all SNP pairs that were selected from each training data set. Therefore, a prediction model was derived using random forest with each training data set (as described above with the ranger function with default settings), the phenotypic values of the corresponding test data set were predicted with the prediction model, and prediction accuracy was estimated as R2 between the predicted and the observed values.

However, since PLINK’s epistasis test is based on linear regression and random forest is a method from machine learning, we developed an alternative for selecting SNP pairs from single SNPs based on machine learning methods. Therefore, we used the information about the variable importance of each single SNP as it was provided by the Boruta function and combined the single SNPs with the highest variable importance to all possible SNP pairs. Subsequently, genomic prediction was performed using all SNP pairs created with the best single SNPs and the resulting prediction accuracy was stored. This process was repeated for the three to 200 single SNPs with the highest variable importance. Finally, the number of single SNPs was determined where the resulting SNP pairs maximised the prediction accuracy. As with the other methods, this method was repeated with each training data set individually to avoid bias and afterwards the median from the ten repetitions was calculated for each number of analysed SNPs. An R script as well as the data from this trial are provided at https://github.com/tmlange/IFS\_SNPpairs.git44 to give researchers the possibility to perform feature selection with SNP pairs based on the variable importance of single SNPs.

The feature selection with single SNPs and SNP pairs selected a certain number of best SNPs in each of the ten repetitions which were carried out independently from each other. The selected SNPs from each repetition were subsequently compared to assess their consistency across iterations. To evaluate the stability of the selected markers, a count was performed to determine the frequency of SNPs being chosen in these repetitions. SNPs that were selected at least 50% of the time were considered as robustly identified features. In this way, the results from the genomic prediction were used to detect SNPs that are associated with rhizomania resistance.

Results

The ELISA data were measured using the Infinite F50® which produces OD values between zero (theoretically minimal absorbance) and four (maximum absorbance).48 The sugar beet population (without the susceptible control) provided raw OD values measured after 60 minutes in the range from 0.1089 to 4, after 90 minutes in the range from 0.1107 to 4, and after 120 minutes in the range from 0.1131 to 4. The transformed data were in the range from -7.06 to 12.33. These results demonstrate the maximum possible variation in virus concentrations that the machine can measure, indicating that the resulting data set provides sufficient variance in resistance levels for genomic prediction.

Genomic prediction and feature selection using single SNPs

The median prediction accuracies for all methods described are presented in Table 2. First, genomic prediction was conducted using all 9,127 single SNPs that remained after filtering. Genomic prediction with these single SNPs across ten random splits of the data set resulted in a median prediction accuracy of R2 = 0.146.

Table 2. Prediction accuracy as median of R2 from the ten repetitions with each of the four methods: Using all single SNPs that were left after filtering, using the 29 SNPs that were assumed to be the optimal subset after feature selection, using all SNP pairs that were left after selection via PLINK’s epistasis test, and using the SNP pairs that result from including the 16 single SNPs with the highest variable importance.

MethodSingle SNPsSNP pairs
All variables after filtering0.1460.191
Subset after feature selection0.2670.306

In addition to genomic prediction with all SNPs, incremental feature selection was performed to optimise prediction accuracy by selecting a subset of the most informative SNPs. Figure 1 illustrates the number of SNPs in the prediction model on the X-axis and the corresponding median prediction accuracy from the ten repetitions on the Y-axis. It is evident that prediction accuracy increases steeply with the inclusion of the initial SNPs. However, just above R2=0.25, the prediction accuracy peaks and then gradually decreases, with the decline being less steep than the initial rise. Using this approach, the optimal set of SNPs for genomic prediction was identified. The prediction accuracy was maximised when 29 SNPs were included in the model, resulting in a median prediction accuracy of R2=0.267.

306b34d7-b397-4e39-b4c4-6e9d05c755a2_figure1.gif

Figure 1. Median of the R2 values from the ten repetitions of genomic prediction using random forest with the 2, … , 9,127 SNPs.

Genomic prediction and feature selection using SNP pairs

Besides genomic prediction and feature selection with single SNPs, similar approaches have been performed using SNP pairs. To perform genomic prediction with SNP pairs, PLINK’s epistasis test was performed with default settings for each training data set individually. After filtering via PLINK’s epistasis test, the resulting sample sizes ranged from 46,556 to 87,529 SNP pairs. Taking the 41 million theoretically possible SNP pairs into consideration, this is a reduction to 0.1% to 0.2%. When genomic prediction was performed with all SNP pairs that were left after filtering, the median prediction accuracy was R2=0.191.

Furthermore, feature selection was performed to identify an optimal subset of SNP pairs for genomic prediction. The best single SNPs, judged by their variable importance, were used to produce SNP pairs, and incremental feature selection was conducted using random forest with the SNP pairs derived from the top 3 to 200 single SNPs. Figure 2 displays the median prediction accuracy from the ten repetitions on the Y axis and the corresponding number of single SNPs that make up the SNP pairs on the X axis.

306b34d7-b397-4e39-b4c4-6e9d05c755a2_figure2.gif

Figure 2. Median of the R2 values from the ten repetitions of genomic prediction using random forest when the 3, … , 200 best single SNPs are combined to SNP pairs.

One can see that the prediction accuracy increases steeply when a small number of SNP pairs are included in the prediction model. Similar to Figure 1 that displays the prediction accuracy with the single SNPs, the prediction accuracy with the SNP pairs also forms a peak and decreases from there on. One can see that the peak height is slightly above R2=0.3 when SNP pairs are created using the 16 best SNPs.

Considering Figure 2, it appears that the number of SNP pairs has an effect on the resulting prediction accuracy. This suggests that the prediction accuracy might be affected by the number of SNP pairs that is selected via PLINK’s epistasis test. Therefore, we have analysed how many SNP pairs were selected via PLINK if the significance threshold was modified. However, the resulting prediction accuracy remained unchanged with different thresholds for PLINK’s epistasis test (tested for thresholds 10−2, … , 10−8, data not shown). Consequently, we conclude that although the number of selected SNP pairs can be easily adjusted in PLINK’s epistasis test, this selection does not affect the resulting prediction accuracy.

Finally, we counted how often SNPs were selected in the ten repetitions of the feature selection. The four SNPs “SNP0425”, “SNP2484”, “SNP6428”, and “SNP7343” were selected at least 50% of the time in the feature selection in the ten repetitions. The SNPs “SNP2484” and “SNP6428” are located on chromosome 3. However, “SNP6428” is located on chromosome 2 and “SNP7343” is located on chromosome 5.

After identifying these four SNPs, SNP pairs were created as described in Table 1 with “SNP2484” as one of the two SNPs that were located on chromosome 3 together with “SNP6428” which was located on chromosome 2 as well as “SNP2484” together with “SNP7343” which was located on chromosome 5. Figure 3 displays the virus concentration depending on the four different genotypes resulting from the two SNP pairs.

306b34d7-b397-4e39-b4c4-6e9d05c755a2_figure3.gif

Figure 3. The virus concentration of the plants in the trial based on the different genotypes of the SNP pair “SNP2484-SNP6428” on the left and the SNP pair “SNP2484-SNP7343” on the right.

The genotype of the SNP pair was defined as the additive-additive SNP-interaction model in Ref. 52.

In both graphs in Figure 3, it is evident that the genotype 0 (both SNPs homozygous for the major allele) resulted in the lowest virus concentrations, while genotype 4 (both SNPs homozygous for the minor allele) led to the highest virus concentrations. Analysing the effect of the genotypes on the virus concentration with an ANOVA produced highly significant p values for both SNP pairs (SNP2484-SNP6428: p = 6.6 • 10−8; SNP2484-SNP7343: p=1.1• 10−7) and R2 values of R2 = 0.2115 for SNP2484-SNP6428 and R2 = 0.2066 for SNP2484-SNP7343. Thus, more than 20% of the total variability in the data can be explained with each SNP pair individually.

Discussion

By reducing the number of SNPs to the 29 SNPs with the highest variable importance, we achieved a higher median prediction accuracy compared to the prediction model using all available SNPs. Previous studies on variable importance in genomic prediction have concluded that, although SNP interactions can be detected in random forest algorithms, these interactions can be masked by other variables when working with high-dimensional data.66,67 Therefore, it is possible that SNP interactions were masked when genomic prediction was performed with all single SNPs. Consequently, reducing the number of SNPs allowed these interactions to be more effectively included in the prediction model.

Besides genomic prediction using single SNPs, we also present results from genomic prediction using SNP pairs. Similar to the single SNPs, we performed genomic prediction with all available SNP pairs and conducted feature selection with these pairs. This method reduced the number of SNP pairs to those involving the 16 SNPs with the highest variable importance. Again, prediction accuracy improved when only a subset of all available SNP pairs was used. These results suggest that rhizomania resistance is influenced by interactions between SNP pairs which may be masked when all SNP pairs are included in the prediction model.

Feature selection was used in recent studies to increase the prediction accuracy of genomic prediction models in man63 and crops.54,58 However, these studies led to heterogeneous results such that no general recommendation can be given regarding feature selection. Here, we show that in case of rhizomania resistance, prediction accuracy could be increased using feature selection. We postulate that the success of implementing feature selection to improve prediction accuracy might be related to epistatic effects that are masked if a large number of SNPs are included in a prediction model.

Besides improving prediction accuracy of genomic prediction models, other studies used results from feature selection via genomic prediction to determine the association between certain SNPs and the phenotype.58,68 However, it is more challenging to select certain SNPs using variable importance measures compared to using the p value from a hypothesis test as it is done in genome-wide association studies. Here, we describe a novel method to select SNPs based on feature selection in genomic prediction, identifying the optimal number of SNPs to maximise prediction accuracy. However, this approach can result in different SNPs being selected in each training data set. We addressed this problem by identifying SNPs selected in multiple training data sets, underline the importance of repeating such analyses when using machine learning methods to identify SNPs that are associated with the phenotype.

While it can be argued that it would have been sufficient to use only the SNPs on chromosome 3 when performing a genome-wide association study to analyse each SNP individually, we included all available SNPs to identify potential interactions. This comprehensive approach revealed two SNPs on chromosomes not previously linked to rhizomania resistance. Our findings suggest that although the individual effects of these SNPs are modest, they significantly influence resistance when combined with SNPs on chromosome 3, which is known to be associated with rhizomania resistance. This kind of SNP interaction indicates a non-additive interaction between genes on different chromosomes. These results underline the importance of considering SNPs from various genomic regions when analysing not only the effects of the individual SNPs but also the interactions between them. Furthermore, the results suggest that rhizomania resistance is caused by epistatic effects.

Conclusions

Although rhizomania resistance in sugar beet has been assumed to be a quantitative trait influenced by both major and minor resistance genes, there have been no prior attempts at genomic prediction for this trait. Our study provides the first attempt to predict rhizomania resistance in sugar beet genotypes using a population that carried resistances at both of the known resistance clusters. Our results suggest that genomic prediction of rhizomania resistance is feasible, providing evidence that the genomic architecture of this resistance is likely influenced by more than just the two known resistance clusters.

To perform genomic prediction, we have used single SNPs as well as SNP pairs. In the provided data set, the genomic prediction using SNP pairs led to higher prediction accuracy than the genomic prediction using single SNPs. This suggests that epistatic effects might affect rhizomania resistance and that the usage of SNP pairs can include these effects more efficiently in the prediction model. We also used the variable importance of the SNPs for feature selection with both single SNPs and SNP pairs. In both cases, prediction accuracy improved compared to using all available SNPs or SNP pairs. While random forest can detect SNP interactions, such interactions can be masked by other variables in high-dimensional data. Therefore, rhizomania resistance might be best predicted by including SNP interactions in the prediction model and reducing the number of SNPs to prevent masking the interactions.

By analysing which SNPs were consistently selected across different training data sets, we identified four SNPs frequently chosen during feature selection. Two of these SNPs were located on chromosomes not previously associated with rhizomania resistance. The two SNP pairs created with one of the SNPs on chromosome 3, where all known resistance clusters are located, and the two SNPs on other chromosomes, showed significant differences in virus concentration for each genotype. Each SNP pair alone explained more than 20% of the total variance.

Although the data were not sufficient to pinpoint specific genes for rhizomania resistance, we demonstrated that our method effectively detects interactions between SNPs that would not have been identified using a genome-wide association study analysing each SNP individually. To encourage researchers to perform feature selection with SNP pairs in their own studies, we have published an R script as well as the data from this trial at https://github.com/tmlange/IFS_SNPpairs.git.44

Data availability

The phenotypic data of all plants used in this trial as well as the SNP data in a recoded form have been published at https://github.com/tmlange/IFS_SNPpairs.git together with the R script to repeat the described method. The genomic position of the SNPs as well as the names of the sugar beet lines used to produce the population of this trial are available upon reasonable request from KWS Saat SE & Co. KGaA.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 14 Mar 2023
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Lange TM, Heinrich F, Kopisch-Obuch F et al. Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection [version 2; peer review: 1 approved with reservations, 3 not approved]. F1000Research 2024, 12:280 (https://doi.org/10.12688/f1000research.131134.2)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 2
VERSION 2
PUBLISHED 28 Aug 2024
Revised
Views
1
Cite
Reviewer Report 27 Nov 2024
Daniela Holtgräwe, Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany 
Approved with Reservations
VIEWS 1
There is a lot of progress on the manuscript in more or less all addressed points. There are still some problems with the Github entries. The provided link for the SNP calling data is wrong. The link needs to be ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Holtgräwe D. Reviewer Report For: Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection [version 2; peer review: 1 approved with reservations, 3 not approved]. F1000Research 2024, 12:280 (https://doi.org/10.5256/f1000research.170607.r317943)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
12
Cite
Reviewer Report 17 Sep 2024
Chenggen Chu, USDA-ARS, North Dakota, USA 
Not Approved
VIEWS 12
The manuscript entitled "Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection" conducted genomic prediction for rhizomania resistance in sugar beet. However, I'm a little confused by the reports due to ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Chu C. Reviewer Report For: Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection [version 2; peer review: 1 approved with reservations, 3 not approved]. F1000Research 2024, 12:280 (https://doi.org/10.5256/f1000research.170607.r319009)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Version 1
VERSION 1
PUBLISHED 14 Mar 2023
Views
28
Cite
Reviewer Report 27 Sep 2023
Daniela Holtgräwe, Genetics and Genomics of Plants, CeBiTec and Faculty of Biology, Bielefeld University, Bielefeld, Germany 
Not Approved
VIEWS 28
The present manuscript deals with genomic prediction of the presumably quantitative trait 'Rhizomania resistance' in sugar beet using genome-wide SNP data. The paper presents bioinformatic and ML-based calculations using all SNPs individually, or in pairs and adding the SNP information ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Holtgräwe D. Reviewer Report For: Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection [version 2; peer review: 1 approved with reservations, 3 not approved]. F1000Research 2024, 12:280 (https://doi.org/10.5256/f1000research.143945.r196899)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 28 Aug 2024
    Thomas Martin Lange, Breeding Informatics Group, University of Göttingen, Göttingen, 37075, Germany
    28 Aug 2024
    Author Response
    Reviewer Comment:
    (1) In the introduction, the individual genes identified to confer resistance to Rhizomania should be mentioned and the underlying resistance mechanism, if known, should be briefly described. At ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 28 Aug 2024
    Thomas Martin Lange, Breeding Informatics Group, University of Göttingen, Göttingen, 37075, Germany
    28 Aug 2024
    Author Response
    Reviewer Comment:
    (1) In the introduction, the individual genes identified to confer resistance to Rhizomania should be mentioned and the underlying resistance mechanism, if known, should be briefly described. At ... Continue reading
Views
33
Cite
Reviewer Report 12 Sep 2023
Muhammad Massub Tehseen, Department of Plant Sciences, North Dakota State University, Fargo, North Dakota, USA 
Not Approved
VIEWS 33
This paper aimed at comparing several four methods to investigate genomic prediction models to predict rhizomnia resistance in sugar beet. The topic is of general interest and the findings could be used in sugar beet breeding programs targeting rhizomnia resistance. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Tehseen MM. Reviewer Report For: Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection [version 2; peer review: 1 approved with reservations, 3 not approved]. F1000Research 2024, 12:280 (https://doi.org/10.5256/f1000research.143945.r201121)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 28 Aug 2024
    Thomas Martin Lange, Breeding Informatics Group, University of Göttingen, Göttingen, 37075, Germany
    28 Aug 2024
    Author Response
    Reviewer Comments:
    (1) First of all there is no clear description of the population panel used in the current study, the authors reported a panel of 156 genotypes followed by ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 28 Aug 2024
    Thomas Martin Lange, Breeding Informatics Group, University of Göttingen, Göttingen, 37075, Germany
    28 Aug 2024
    Author Response
    Reviewer Comments:
    (1) First of all there is no clear description of the population panel used in the current study, the authors reported a panel of 156 genotypes followed by ... Continue reading
Views
48
Cite
Reviewer Report 28 Jul 2023
J Mitchell McGrath, USDA-ARS Sugarbeet and Bean Research Unit, Michigan State University, East Lansing, Michigan, USA 
Not Approved
VIEWS 48
First review of draft of "Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection." by Lange et al. (doi.org/10.12688/f1000research.131134.1) for potential indexing.

This manuscript details computational investigations ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
McGrath JM. Reviewer Report For: Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection [version 2; peer review: 1 approved with reservations, 3 not approved]. F1000Research 2024, 12:280 (https://doi.org/10.5256/f1000research.143945.r181537)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 28 Aug 2024
    Thomas Martin Lange, Breeding Informatics Group, University of Göttingen, Göttingen, 37075, Germany
    28 Aug 2024
    Author Response
    Reviewer Comment:
    (1) It seems that a study to dissect additional components of rhizomania resistance would include analysis of the rhizomania resistance genes. The authors state that the germplasm used ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 28 Aug 2024
    Thomas Martin Lange, Breeding Informatics Group, University of Göttingen, Göttingen, 37075, Germany
    28 Aug 2024
    Author Response
    Reviewer Comment:
    (1) It seems that a study to dissect additional components of rhizomania resistance would include analysis of the rhizomania resistance genes. The authors state that the germplasm used ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 14 Mar 2023
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.