Mango quality prediction based on near-infrared spectroscopy using multi-predictor local polynomial regression modeling [version 1; peer review: awaiting peer review]

Background: pH and total soluble solids (TSS) are important quality parameters of mangoes; they represent the acidity and sweetness of the fruit, respectively. This study predicts the pH and TSS of intact mangoes based on near-infrared (NIR) spectroscopy using multi-predictor local polynomial regression (MLPR) modeling. Herein, the prediction performance of kernel partial least square regression (KPLSR), support vector machine regression (SVMR), and MLPR is compared. Methods: For this purpose, 186 intact mango samples at three different maturity stages are used. Prediction models are built using MLPR, KPLSR, and SVMR based on untreated and treated spectra. The best regression model for predicting pH is MLPR based on Gaussian filter smoothing spectra. Moreover, the TSS value is more accurately predicted using MLPR based on Savitzky–Golay smoothing. Results: The findings reveal that MLPR is highly accurate in estimating the pH and TSS of mangoes, with mean absolute percentage error (MAPE) values less than 10 %. In addition, the MLPR model has the best predictive performance with the lowest Mean Squared error (MSE) and root mean squared error (RMSE) values


Introduction
Indonesia's Gadung Klonal 21, commonly known as Avomango, is a popular mango cultivar owing to its thick flesh, low fiber content, and sweet flavor. 1 Avomango can be eaten in the same way as an avocado. It was developed in Pasuruan Regency, East Java. Even now, fruit pickers hand-pick mangoes and must determine whether the mango is sufficiently mature for picking. Maturity indices determine the quality and shelf life of harvested fruits, and provide the necessary flexibility for transport and marketing. 2 Harvest maturity is the stage of development in climacteric fruits, such as mangoes, when the fruit is harvest-ready and of an acceptable consumer grade. Consumer maturity is achieved when the fruit is ready for consumption or utilized in other ways. In climacteric fruits, consumer maturity is reached after harvest maturity. 3 Mature mangoes have a low pH value and high Brix percentage; the pH value increases throughout the maturation period.
Fruit maturity indices can be estimated accurately using destructive methods. These methods are damaging, timeconsuming, labor-intensive, and require manual loading. 4 Furthermore, they are costly, require a long time for sample preparation, and are wasteful. 5 Thus, rapid, non-destructive, and environment-friendly analytical methods are required. Hence, non-destructive techniques, such as visible imaging, colorimetry, visible and near-infrared (NIR) spectroscopy, computed tomography, hyperspectral imaging, fluorescence imaging, and multispectral imaging, have been developed to evaluate fruit maturity. 3 NIR spectroscopy is the most widely used non-destructive technique in post-harvest fruit and vegetable quality determination. 6 Moreover, NIR spectroscopy techniques have been used to overcome the limitations of destructive methods while maintaining the physicochemical attributes of food and agricultural products. In addition, using NIR spectroscopy technology helps support sustainable production as a sustainable development goal (SDGs).
Physicochemical, physical, and biological changes occur in mangoes during ripening. These changes differ depending on the mango variety. The color does not reflect the stage of maturity in Gadung Klonal 21 as the green color of the mature mango is similar to that of the raw mango. The texture of mature mangoes differs significantly; the tip of the fruit has a soft texture. A significant change is observed in the level of sweetness and acidity of mangoes, which can only be detected by destructive analysis. However, numerous studies have been conducted to predict the sweetness and acidity of mangoes using regression modeling based on spectral data from NIR spectroscopy. [7][8][9][10][11] Studies have been conducted to determine and forecast the maturity and quality of mangoes. [7][8][9][10][11][12] Spectral data must be modeled using several regression methods, either as raw spectra or pre-processed spectral data, to estimate the internal quality of mangoes. Most of the previous studies have modeled spectral data using linear regression and nonlinear regression. Linear regression including partial least squares regression (PLSR) and principal component regression are the two most popular approaches for NIR calibration. 13 Linear regression is commonly used in parametric regression techniques to predict the internal quality of fruit. Furthermore, nonlinear regression methods have the potential for such application. Forecasting changes in quality in agricultural products requires predictive modeling that considers the effectiveness of a nonlinear regression model. 14 To predict the fruit quality, Nicolaï, Theron, and Lammertyn 20 employed nonlinear regression analysis involving the kernel partial least squares regression method (KPLSR). Anderson, Walsh, Flynn, and Walsh 15 reported using local PLSR to estimate the dry matter content of mango. The application of nonlinear regression to fruit quality prediction is relatively underexplored.
A nonparametric regression approach can be used to model unpatterned data, including the nonlinear cases. Nonparametric regression for predicting the acidity level of mangoes was studied by Ulya and Chamidah. 22,23 The prediction of the sweetness level of mangoes has been reported by Ulya, Chamidah, and Saifudin. 16 These studies report that nonparametric regression based on local polynomial estimators, particularly, multiple polynomial regression (MPR), results in a better predictive model than the parametric regression approach.
Local polynomial regression can capture nonlinear patterns between the response and the predictor variables. 17 This method looks at neighboring data for the specified bandwidth, matches separate piecewise regressions for each part, and combines them. 18 The local regression fit is complete when all the data points are identified using the regression function values. Function calculation in local polynomial and local linear regression is performed locally at the point to be estimated. 19 This is different from spline regressions. [20][21][22][23] This local estimation technique captures nonlinearities that may exist without the influence of dataset outliers at each estimation stage. 24 This method is data-based and easy to implement. It provides a flexible structure that can capture the nonlinear characteristics present in the data compared to multiple linear regression (MLR). 25 Mango pH value prediction using multi-predictor local polynomial regression (MLPR) and MPR was investigated by Ulya et al. 26 The results indicate that the MLPR method provides better predictive performance with a lower mean absolute percentage error (MAPE) value than the MPR method. However, some studies have attempted to overcome the problem of nonlinearity when predicting internal fruit quality based on NIR spectroscopy using KPLSR 27 and support vector machine regression (SVMR). 28,29 These two approaches perform better compared with MLR and PLSR. However, to date, no study has compared the predictive performance of nonparametric regression approaches, such as MLPR with KPLSR and SVMR, in predicting the internal quality of mangoes.
This study aims at comparing the performance of a mango pH and total soluble solid (TSS) prediction model based on MLPR with KPLSR-and SVMR-based models. MLPR was found to be the best regression model; it exhibited a predictive performance with the lowest mean squared error (MSE), root mean squared error (RMSE) and MAPE values, and the highest R 2 value. The MLPR algorithm in this study is useful to design instruments to detect the acidity and sweetness of intact mango.

Sample preparation
A total of 186 mangoes (Mangifera indica L, Gadung Klonal 21) were collected from a garden in Wonokerto Village, Sukorejo District, Pasuruan Regency, Indonesia. Mango samples weighing 250-300 g at varying stages of ripeness, ranging from unripe to ripe, were chosen. The mangoes were cleaned and air-dried before being wrapped in Styrofoam fruit netting, and were subsequently placed in boxes (approximately 12 mangoes). Fruit boxes were screened to avoid collisions.

NIR spectra data acquisition
The spectral data for intact mangoes was acquired using an NIR spectrometer (OtO Photonics. Inc.) in the range of 900-1650 nm at 7 nm intervals. The samples were scanned in reflectance mode to record the spectral data. The process of NIR spectra measurement was conducted by firing a halogen lamp on the sample, which was positioned at an angle of 45°to the sample, with the detector positioned at 45°to the sample. Each sample was scanned at three separate locations (the base, center, and tip of the fruit) and the obtained scans for each sample were averaged. The spectral data were originally presented in terms of the reflectance value (R) and were later converted to the absorbance spectra value (log 1/R).

Spectral data pre-treatment
Before developing the prediction models, some pre-treatment methods can eliminate undesired effects, including random noise, high-frequency noise, light scattering, baseline shifts, and any other external effects caused by environmental or instrumental factors. Furthermore, smoothing effectively reduces the high-frequency noise. Among the numerous smoothing approaches in the field, Savitzky-Golay (SG) smoothing is one of the most widely used. 30 Using SG can retain the signal properties, including the maximum and minimum relative values, and the width of the peak, which are lost when using other smoothing techniques. The present work pre-treated the spectra using SG smoothing generated with two-degree polynomials, Gaussian filter smoothing (GFS), and MSC.

Measurement of pH and TSS
Mango samples (10 g per sample) were blended with 40 ml of distilled water in a fruit blender. Mango juice was measured using a digital pH meter (Lutron pH-208). Triplicate measurements were performed to obtain average values. A small amount of mango juice was dropped onto a pocket digital refractometer (ATAGO PAL-1) to record TSS, expressed in terms of degree Brix (°Bx). The measurements were conducted at room temperature after spectral acquisition.

Statistical analysis
The pH, TSS, and spectral data was organized into matrices. The matrix rows represent the 186 samples, and the 114 columns represent the predictor (X) and response (Y) variables. The predictor variables were the wavelengths of 112 NIR spectra for each mango sample. The response variables described the measured pH and TSS values associated with each sample in the first and the second column, respectively.
The following steps were to perform dimension reduction using principal component analysis (PCA), detect outliers using Hotelling's T 2 ellipse method, and then analyze them using KPLSR, SVMR, and MLPR. The analysis was performed using calibration and validation models of the pH and TSS values. The Unscrambler X 10.4 software was used to perform spectral pre-processing and model development for pH and TSS values. The open-source software R was used to perform the MLPR method. The calibration and validation models' absorbance spectral data of the reference vs. predicted pH was plotted to investigate the nature of the spectral absorbance distribution.

Modeling using different calibration methods
The dataset was divided into two parts 80% as calibration data and the rest as validation data. For multivariate calibration, the data were modeled using a parametric regression method, including KPLSR and SVMR. Additionally, the data were modeled using the nonparametric regression method MLPR. 26 Subsequently, predictions were conducted on the validation dataset based on the model developed for the calibration dataset. The prediction performance of the three methods was compared in this study.

Multi-predictor local polynomial regression (MLPR)
MLPR for predicting the internal quality of fruits was proposed by Ulya et al. 16 The prediction was obtained using a nonparametric regression approach based on a local polynomial estimator with one response variable and multiple predictors. The MLPR model has a response variable y that depends on the sum of some functions of the predictor variable x and can be written as follows.
The parameters were estimated using the weighted least squares (WLS) method by minimizing the following. where is the product of kernel functions K(.), which was used as the weighting element in the WLS optimization process. This study used a Gaussian kernel, which is defined as follows.
In addition, the optimum bandwidth (h) as a smoothing parameter in the estimation process must be determined using this method. If the bandwidth value decreases, the regression estimation becomes rougher and vice versa. The optimum bandwidth is the bandwidth with the minimum generalized cross-validation (GCV) value 31 and is calculated using the following formula.

Model evaluation
Several methods have been used to assess the algorithm's performance in the prediction results. One such method is K-fold cross-validation, wherein the data is randomly divided into k parts before training a classifier with one part and testing it with another. 32 This method can reduce sampling bias because the data is randomly divided into several (k) parts. 33 The final accuracy of this process is the average accuracy of the number of processes. 34 In this study, five-fold cross-validation was used (Figure 1). The 165 samples were split into calibration and validation data, with 80% used as calibration models and the rest as validation models.
Generally, in the studies on predicting the internal quality of fruits using NIR spectroscopy, the evaluation of the predictive performance and accuracy of the models is performed on the validation dataset. Previous studies have used R 2 and RMSE to evaluate predictive models. However, this study evaluated the predictive model using MSE and MAPE values. The most frequently used forecasting accuracy measurement is MAPE. 35 MAPE has some significant and desirable characteristics, including reliability, unit-free measurement, interpretability, clarity of presentation, statistical evaluation support, and utilization of all error information. 36 The aforementioned four criteria are suitable for comparing the predictive performances of parametric and nonparametric regressions. Generally, a good model must have a high R 2 value and low RMSE, MSE, and MAPE. The RMSE, R 2 , 29 MAPE, 36 MSE, 37 and overall of RMSE, MSE, R 2 and MAPE formulas can be defined using Equations (5)- (12). where b y i is the estimated value of the i th response variable; y i is the measured value of the response variable; y i is the average measured value of the response variable; n is the number of observations; C is the calibration data; and V is the validation data.
The stages of developing the prediction models and measuring the predictive performance in this study are summarized in Figure 2.

Pre-treatment
The raw absorption spectra of 186 samples of three different types of mangoes acquired from a spectrometer of wavelength 900-1650 nm are shown in Figure 3(a). The spectral data were pre-treated with SG, a Gaussian filter, and MSC to reduce high-frequency noise, as shown in Figures 3(b)-(d). The MSC technique was used to correct the data by approximating the additive and multiplicative effects of the spectra. 38 After pre-processing the spectral data, the dimension of the absorbance spectra data was reduced by PCA using the singular value decomposition algorithm. Two latent variables representing 99.75% of the variance were selected. Spectral outliers were identified using PCA, subject to Hotelling's T 2 ellipse. In this study, 21 outliers were identified. These outliers were excluded because they could have a negative impact on the model. Sample outliers may provide helpful information, but they can also be non-representative samples that contribute to errors in a model. 39 The final sample consisted of 165 observations, divided into two parts: calibration and validation datasets with five-fold cross-validation (see Figure 1). Each fold consisted of 132 calibration data samples and 33 validation data samples.

Descriptive statistical values of the pH and TSS of the mangoes
The descriptive statistical values for the measured pH and TSS are presented in Table 1. The robustness of the calibration models was evaluated using five-fold cross-validation. Three different calibration models (KPLSR, SVMR, and MLPR) were developed using the calibration dataset for each pre-treatment method (raw, SG smoothing, Gaussian filter, and MSC) to predict the pH and TSS values of the mangoes.
Predictive performance comparison of pH value Table 2 presents the calibration and validation results for the pH prediction using NIR. The three regressions provided robust models using GFS-treated spectra compared with untreated, SG smoothing, and full MSC spectra. Both calibration and validation in the MSC spectral model yielded higher MAPE values than those of the other two spectral treatments (SG and Gaussian smoothing). The raw spectra model was better than the MSC model but not better than the Gaussian and SG models. This result is similar; the raw spectra model had a worse predictive performance than the pre-treated spectra.
MLPR is the best method for predicting the pH value of mangoes based on the three regression methods used for all spectral data. The prediction model's performance with MLPR had the highest R 2 value and the lowest MSE, RMSE, and MAPE values compared with the KPLSR and SVMR methods.

Predictive performance comparison of TSS value
Predictive performance comparisons of the TSS values are listed in Table 3. Predictive models for TSS values have lower performance than predictive models for pH values. The performance of the pH prediction model was better than that of the TSS prediction model because it had a high R 2 value with a low MAPE value (3.4-5.8%). The predicted pH value was closer to the observed pH value. Although the R 2 value was high in the TSS prediction model, the MAPE value was also relatively high (6.4-8.1%). However, the MAPE value in the TSS model is still classified as highly accurate. 36 Pre-processing spectra using SG smoothing for the TSS value parameter gave the best predictive model results, with the lowest RMSE, MSE, and MAPE values and the highest R 2 value. Discussion pH value MLPR, SVMR, and KPLSR exhibited excellent predictions of the pH of mangoes. Overall, MLPR using GFS spectra provided the best overall model for pH prediction, with R 2 = 0.832, RMSE = 0.187, MSE = 0.036, and MAPE = 3.389% ( Table 2). All treatment spectra revealed that the best predictive model used MLPR, with the highest R 2 value and the lowest RMSE, MSE, and MAPE values compared with the other regression methods. This is consistent with the results of the previous studies by Ulya et al. 16,26 MLPR is a novel method for predicting the internal quality of fruits developed by Ulya et al. 16,26 This study confirms that the MLPR method can produce a robust predictive model for determining the internal quality of mangoes; MLPR (nonparametric regression) has predictive performance with a lower MAPE value than that of the MPR (parametric regression).
Among the three regression methods, MLPR exhibited lower MAPE values in all treatment spectra. With the Gaussian filter spectra calibration and validation data, MLPR provided the highest R 2 and lowest RMSE, MSE, and MAPE compared with the other methods, as shown in Figure 4. Even with the overall data, the R 2 , RMSE, MSE, and MAPE values were better for MLPR than the those of the other methods, which indicates that the MLPR method provides an  accurate prediction of all spectral data with MAPE <10% 36 and low RMSE values. The predictive performance of the MLPR model also had a high R 2 (0.82-0.9), thus indicating good predictive ability. 40 KPLSR method performed the worst in predicting the pH of mangoes. Only a few studies have predicted fruit characteristics using the KPLSR. Most studies use PLSR because of its simplicity and small calculation volume. Partial Least Square (PLS) is a linear method of data analysis. 41 Based on untreated spectra, Nicolaï et al. 27 used KPLSR to predict apple sugar content.
The prediction of the sugar content of Gannan Navel oranges based on several treated spectra was reported by Liu,39 where KPLSR, particularly the spline PLS model, was superior to others with an R 2 of 0.87, RMSE validation of 0.47°B rix, and standard deviation ratio of 2.34. Kernel PLS is suitable for dealing with nonlinear phenomena; this may be owing to the changes in the chemical interactions of the fruit matrix because unripe and ripe fruits have different structures and varieties. 42

TSS value
The best regression method for predicting the TSS value of mangoes was MLPR, based on SG smoothing. The calibration, validation, and overall models of the MLPR method based on all spectral treatments have higher R 2 values and lower RMSE, MSE, and MAPE values than the other methods. The MAPE value is higher than that of the pH prediction model, but the MAPE value is still less than 10%, thus categorizing the method as highly accurate in forecasting. 36 Figure 5 shows the MLPR predictive performance based on SG smoothing. Previous research has shown that a small set of reference attributes and spectral behavior changes influenced by cultivar, fruit size, and fruit origin significantly impact model robustness. 6,43,44 Moreover, the prediction performance was affected by the lack of variability in the calibration model. When validated by samples outside the prediction model range, the prediction model performance in a study investigating the total acid content of Japanese plums decreased. 43 Subedi, Walsh, and Owens 45 reported that a TSS prediction model developed from fruits at late stages of ripening failed to predict the TSS of fruits at earlier stages of ripening.
The best pre-treatment spectral data for predicting the TSS value was SG smoothing. Overall, MLPR with SG smoothing spectra was the best model, with an R 2 value of 0.805, RMSE value of 0.436, MSE value of 0.192, and MAPE value of  6.454. This is in agreement with the findings 36 that the SG smoothing spectral model for predicting the tannin content of persimmon fruit is better than MSC. The R 2 values for SG smoothing and MSC were 0.107 and 0.016, respectively. 29 In contrast to, 28 the prediction of the mangoes' TSS values using SVMR based on extended MSC gave an R 2 validation of 0.86 and an RMSE of 0.66.
Generally, all spectral treatment and regression methods on the calibration model have an R 2 value higher than that of the validation model, with a small gap between them; this indicates that the k-fold cross-validation method can balance the prediction results of the calibration and validation datasets. The k-fold cross-validation method can reduce bias in sampling. 33 If the test matrix method is used, the R 2 of the validation models would be less than the R 2 of the calibration model. Louw and Theron 43 reported that the prediction model's performance for the total acid content of Japanese plums decreased when samples outside the prediction model range were validated.

Conclusions
Prediction of the internal quality of mangoes, including pH and TSS, can be performed rapidly and non-destructively using NIR spectroscopy. Spectral pre-treatment, such as SG smoothing, GFS, and MSC, affects the ability of the prediction model to use KPLSR, SVMR, and MLPR. The best regression model for pH prediction is MLPR based on a GFS spectra. In addition, KPLSR, SVMR, and MLPR based on raw spectra, SG smoothing, and MSC also provided highly accurate prediction performance, with MAPE values of less than 10%, low MSE and RMSE, and high R 2 .
The best regression model for TSS prediction was MLPR based on SG smoothing. In addition, KPLSR, SVMR, and MLPR based on raw spectra, GFS, and MSC also provided highly accurate prediction performance, with MAPE values of less than 10%, low MSE and RMSE, and high R 2 . We believe that NIR spectroscopy can be used to determine the internal quality of mangoes. However, further research is required to improve the prediction model performance of TSS values using MLPR based on a combination of several pre-treatment spectra. In conclusion, NIR spectroscopy combined with nonparametric regression MLPR could become a rapid and non-destructive alternative method for predicting the internal quality of mangoes.

The benefits of publishing with F1000Research:
Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com