Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.169436.2

Research Article

Articles

GBWOEM: A Gradient-Based Weight Optimization Model for Improved Predictive Accuracy in Healthcare

[version 2; peer review: 1 approved, 2 approved with reservations]

Das

Surajit

Data Curation Formal Analysis Software Writing – Original Draft Preparation 1 Nayak

Samaleswari P.

Resources Supervision Validation 2 Sahoo

Biswajit

Conceptualization Methodology Project Administration https://orcid.org/0000-0003-1355-3395 1 Champati Rai

Satyananda

Funding Acquisition Investigation Visualization Writing – Review & Editing https://orcid.org/0000-0002-4237-4591 a 1 1School of Computer Engineering, Kalinga Institute of Industrial Technology, Bhubaneswar, Odisha, 751024, India 2Department of Computer Science and Engineering, Silicon University, Bhubaneswar, Odisha, 751024, India

a satya.raifcs@kiit.ac.in

No competing interests were disclosed.

27 1 2026

2025

1161

22 1 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

The use of ensemble learning has been crucial for improving predictive accuracy in healthcare, especially with regard to critical diagnostic and classification problems. Ensemble models combine the strengths of multiple ML models and reduce the risk of misclassification, which is important in healthcare, where accurate predictions impact patient outcomes.

Methods

This study introduces the Gradient-Based Weight Optimized Ensemble Model (GBWOEM), an advanced ensemble technique that optimizes the weights of five base models: Decision Tree Classifier (DTC), Random Forest Classifier (RFC), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and K-Nearest Neighbours (KNN), through optimizing the weights. Two variants, GBWOEM-R (random weight initialization) and GBWOEM-U (uniform weight initialization), were proposed and tested on five healthcare-related datasets: breast cancer, Pima Indians Diabetes Database, diabetic retinopathy debrecen, obesity level estimation based on physical condition and eating habits, and thyroid diseases.

Results

The test accuracy of the proposed models increased to 0.48-8.26% over the traditional ensemble models, such as Adaboost, Catboost, GradientBoost, LightGBM, and XGBoost. Performance metrics, including ROC-AUC analyses, confirmed the model’s efficacy in handling imbalanced data, highlighting its potential for advancing predictive consistency in healthcare applications.

Conclusion

The GBWOEM model improves the predictive accuracy and offers a reliable solution for healthcare applications even when dealing with the imbalance data. This strategy has the potential to ensure patient outcomes and diagnostic consistency in healthcare settings.

Ensemble Learning Healthcare Weight Optimization GBWOEM Classification ROC Curve Machine Learning Predictive Accuracy AUC.

Kalinga Institute of Industrial Technology

The author(s) declared that no grants were involved in supporting this work.

Revised Amendments from Version 1

The manuscript has been modified in response to the reviewers' helpful comments concerning model interpretability, data accuracy, and generalizability. For interpretation ease, we expanded the Results part to numerically demonstrate how the induced weights are controlling the contribution of each base model on top, with updated figures showing comparisons of our method against Random and Uniform weights strategies. With respect to data processing, we have explained our approach to support the use of natural data distributions and proven that the proposed model is capable of learning minority class borders without synthetic oversampling. We additionally introduced a full Train-vs-Test analysis to confirm the high AUC values and exclude overfitting; see new statement in Results section: `the agreement between training-test metrics supports that the model is robust to new data'.

Introduction

In recent times, the use of ML in healthcare has gained enormous popularity owing to the increasing need for high-quality, timely, and efficient predictions, which are necessary for reliable diagnosis as well as assisting patients and physicians in treatment planning. Given that healthcare is intrinsically steeped in complex, high-stakes decision-making processes, this prediction errors can be catastrophic to patient care. This has led to the need to develop machine learning models that are not only accurate in prediction, but also transparent, generalizable across broad patient populations and practice settings, and robust against small errors in input features. Ensemble learning is a powerful approach among ML methodologies. Ensemble learning improves the overall performance of the system by combining multiple base models to create a single model that strengthens each other while covering their individual. ¹ This model aggregation decreases the variance and reduces bias and overfitting. This is particularly important when using healthcare datasets because they are often complex, that is, high-dimensional, noisy, imbalanced, or even insufficient. Therefore, there is an abundance of healthcare applications where ensemble learning has been applied successfully, such as disease prediction, medical image analysis, and patient outcome forecasting. ^{2,
3}

Recently, ensemble methods, such as Bagging, ⁴ Boosting ⁵ and Stacking, ⁶ have been the focus of many predictive systems because they promise state-of-the-art performance in many domains. The most obvious bagging technique is the Random Forest, which boosts the stability and accuracy of models by training many weak learners independently and then combining their predictions. Boosting algorithms such as Adaboost, GradientBoost, XGBoost, and Catboost adjust the weights for the current model in some ways and continue throughout the predictions to correct any errors, which improves the accuracy of the overall ensemble. Instead, stacking trains is a metamodel that uses predictions from various base models, which usually results in more sophisticated predictions. When applied in healthcare, they have shown important results, allowing researchers and clinicians to develop models that can more accurately predict disease onset, severity, and treatment effectiveness.

However, there is scope for improvement in tuning the contribution of each base model to the ensemble. Traditional approaches tend to treat all base models equally or correct model weights using ad-hoc rules. The shrugging of singular complexities existing in healthcare data has, understandably, failed for most. In healthcare applications, an imbalanced dataset is a common issue where the data only have a few positive classes, and if not handled properly, one will come out with a model that has bias. Furthermore, the heterogeneity of healthcare datasets, where feature spaces and data distributions vary widely, presents a significant challenge for traditional ensemble methods. ^{7,
8}

To address these challenges, we propose a novel Gradient-Based Weight Optimized Ensemble Model (GBWOEM), which is a self-adjusting ensemble learning model in which the weights of individual base models are dynamically assigned and updated via gradient-based optimization. This method assigns more weight to models that make better predictions and reduces the impact of weaker models to improve the overall performance of the ensemble. The sensitivity of the model is extremely high, which enables the minority class to be predicted more accurately. In our empirical evaluation, two major variants of the model were considered, GBWOEM-R and GBWOEM-U, to study and present a performance comparison with existing models. The key contributions of this study are as follows: •

Developed the Gradient-Based Weight Optimized Ensemble Model (GBWOEM) with two different weight initialization strategies (GBWOEM-R and GBWOEM-U), a novel way to optimize the base model’s weights in an ensemble dynamically.

•

Introduced a log-based loss function with a small constant ε to ensure numerical stability and refine the model’s performance on imbalanced datasets.

•

Experiments were conducted on five diverse datasets from the healthcare field, showing that the proposed GBWOEM consistently outperforms popular ensemble models such as Adaboost, Catboost, XGBoost, LightGBM and Gradient Boosting across all cases.

•

Proved to is robust and adaptable to varying data contexts that show improvements in test accuracy for each of the datasets, ranging from 0.48% to 8.26%.

•

We evaluated the relative weights assigned to individual base models in ensemble combinations, giving us useful information about what each contributing base model was doing with respect to the final prediction.

Related work

Several studies have demonstrated that ensemble models are effective in a range of medical classification tasks. In a study by Younas et al. ⁸ used a weighted average ensemble technique combining GoogleNet and ResNet-50 to classify colorectal polyps using an augmented dataset (Gastrointestinal Lesions in Regular Colonoscopy and PICCOLO). Their ensemble model outperformed the base models and some other CNN-based deep neural networks, such as Inception-v3, Xception, DenseNet-20, and SqueezeNet. Bhuiyan and Islam ⁹ used weighted average and maximum voting ensemble techniques to ensemble VGG16, VGG19, and DenseNet201 for malaria classification from red blood cell images. The authors achieved improved performance over weight-based ensemble models and different CNN-ML classifiers. Marques et al. ¹⁰ proposed a cross-validation-based ensemble model using EfficientNetB0, averaging predictions across folds, for malaria detection, and achieved superior results compared to other researchers. Ali et al. ¹¹ proposed a bagging-based ensemble model using a DNN to predict problems in the heart. The results of different networks are combined using Logit Boost, which achieves a better performance than Support Vector Machine (SVM), Logistic Regression (LR), Multi-Layer Perceptron (MLP), Random Forest Classifier (RFC), etc. Dutta et al. ¹² introduced a weighted average-based ensemble with models such as Gaussian Naïve Base (GNB), Decision Tree (DT), XGBoost (XGB), Random Forest (RF), and LightGBM (LGB) for early diabetes prediction, although with limited accuracy.

The boosting-based ensemble model id proposed by Ihnaini et al. ¹³ to predict diabetes, where trees were used as weak learners, outperforming LR, NB, RF, K-Nearest Neighbour (KNN), DT, and SVM. Reddy et al. ¹⁴ used a voting-based ensemble model by combining LR, KNN, RF, DT, and AdaBoost with voting for diabetic retinopathy classification, which outperformed the base models. Habib and Tasnim ¹⁵ used the hard voting technique to form an ensemble model by combining LR, NB, RF, and MLP to classify cardiovascular diseases and outperformed the base models. For brain tumor classification, Al Amin et al. ¹⁶ used majority voting over ResNet-50, DenseNet121, InceptionV3, VGG19, and VGG16 for brain tumor classification, achieving a higher validation accuracy. El-Sappagh et al. ¹⁷ used different ensemble strategies such as majority voting, weighted majority voting, and stacking on the top of different base models such as SVM, MLP, RF, DT, KNN, LR, and XGB for Alzheimer’s disease classification, concluding that stacking with XGBoost is most effective. De Souza et al. ¹⁸ applied stacking with CNN, LSTM, and CNN-LSTM to anxiety classification. For tuberculosis classification, Osamor and Okezie ¹⁹ used a weighted voting ensemble with NB and SVM with PCA and RFE-CV feature selection, achieving notable accuracy despite its simplicity.

Using a federated learning-based setup, Subashchandrabose et al. ²⁰ proposed a decentralized ensemble model for lung cancer classification and compared its performance with other ML models in both centralized and decentralized architectures, concluding that the proposed model works better in decentralized architecture. Abbas et al. ²¹ improved lung cancer classification using a weighted federated ensemble, optimizing the weight of the client’s ANN using the Levenberg− Marquardt and Bayesian regularization techniques, and their weighted sum was used in the server model for the final classification. Kotei and Thirunavukarasu ²² proposed a stacking-based ensemble of nine pre-trained CNN for tuberculosis classification. Despite its high accuracy, the model size and resource utilization are significant. Regarding diabetes classification, Prakash et al. ²³ used hard voting with an ANN, RNN, DBN, Perceptron, and RDF, finding it better than Bagging, Boosting, and Stacking. EL-Rashidy et al. ²⁴ proposed a stacking-based ensemble for the classification of mortality using KNN, MLP, LDA, DT, and LR, and stacked them using LR as a meta-learner, showing better results for other ensembles.

For Covid-19 classification based on Chest X-ray images, Rajaraman et al. ²⁵ proposed an ensemble model using CNN and ImageNet and found that the weighted average is the most efficient. For TB classification Rajaraman and Antani ²⁶ proposed a CNN-based stacking ensemble using pre-trained models, such as Inception-v3, CNN, VGG-16, InceptionReseNet-V2, Xception, and Densenet-121. The model achieved good accuracy at the cost of an increased model size. Juraev et al. ²⁷ accessed different static and dynamic ensemble strategies and concluded that the DESKNN strategy yielded the best results when classic ML models were used as the base models. Anand et al. ²⁸ used a weighted average-based ensemble for the classification of brain tumors using VGG19 and variants of CNN, initializing weights using grid search and achieving better performance than base models.

Through the survey shown in Table 1, a significant difference in the usage of base models based on the type of data was observed. Researchers have mostly used statistical or other traditional models as base models when working with tabular data, where the data are arranged in rows and columns, such as medical records or diagnostic metrics, whereas CNNs are typically employed as base models for image datasets. The use of CNN-based models results in an increase in the complexity and computational footprint compared with other ensemble models. A range of ensemble techniques, such as averaging, aggregation, weighted averaging, voting, weighted voting, boosting, and bagging, have been used to integrate the results of diverse base models. Of all the mentioned techniques, weighted averaging is one of the most commonly used techniques for combining results owing to its effectiveness in improving ensemble performance.

Table 1. Literature review.

Author	Disease	Models	Ensemble technique	Accuracy	Observation
Dutta et al. ¹²	Diabetes	GNB, BNB, RF, DT, XGB, LGB	Weighted Average	73.5	The AUC of the base models is taken as weight.
Amin et al. ¹⁶	Brain Tumor	ResNet-50, DenseNet121, InceptionV3, VGG19, VGG16	Voting	98	Proposed model outperforms base models.
Habib and Tasnim ¹⁵	Cardio vascular Disease	LR, GNB, RF, MLP	Voting	88.42	Proposed model outperforms base models.
Baha et al. ¹³	Diabetes	Trees	Boosting	99.6	Proposed ensemble outperforms existing ML models.
Kotei and Thirunavukarasu ²²	Tuber-culosis	VGG16, VGG19, InceptionV2, MobileNet, Xception, Densenet, EfficientNEtB1, Resnet50, InceptionV3, CNN	Stacking	98.38	Too many CNN-based models increase the complexity and computational footprint.
Younas et al. ⁸	Colorectal Cancer	GoogleNet, ResNet-50	Weighted Average	96.3	Grid search is used for weight initialization.
Ali et al. ¹¹	Heart Disease	DNN, LogitBoost	Weighted Average	98.5	All the experimental models gave their respective highest accuracy at the same feature count.
Juraev et al. ²⁷	Mortality	DT, LR, Linear SVR, KNN, Lidge, Lasso, CB, XGB, RF, GB, LGBM	Voting	98.7	Traditional ensemble models as base model reduces the model gives better performance.
Reddy et al. ¹⁴	Diabetic Retinopathy	LR, DT, KNN, RF, Adaboost	Voting	82	Proposed model outperforms base models.
Marques et al. ¹⁰	Malaria	EfficientNetB0	Averaging	98.29	Average of 10-fold cross-validation is used.
Bhuiyan and Islam ⁹	Malaria	VGG16, VGG19, DenseNet201	Weighted Average, Max Voting	97.92	Proposed model outperforms base models.
EL-Rashidy et al. ²⁴	Mortality	KNN, MLP, LDA, DT, LR	Stacking	94.4	Performs better than existing ensembles.
Prakash et al. ²³	Diabetes	ANN, RNN, DBN, Perceptron, RDF	Voting	92	Voting is giving better results than the other ensemble techniques.
Abbas et al. ²¹	Lung Cancer	ANN	Weighted Sum	96.3	Model works better only in distributed system.
Shaker et al. ¹⁷	Alzheimer’s disease	SVM, MLP, RF, DT, KNN, LR, XGB	Majority voting, Weighted majority voting, stacking	89.15	Stacking with XGB performs better than the other setups.
Rajaraman and Antani ²⁶	Tuber-culosis	Xception, Densenet-121, CNN, VGG-16, Inception-v3, InceptionReseNet-V2	Stacking	94.1	So many pre-trained models increase the computational footprint.
Rajaraman et al. ²⁵	Covid-19	CNN, ImageNet	Weighted Average	99.01	Weighted average with pruned CNN model performs better.
Subashchandrabose et al. ²⁰	Lung Cancer	NN	-	89.63	The centralized approach gave a better result than the decentralized approach.
Souza et al. ¹⁸	Anxiety	CNN, LSTM	Stacking		A blend of CNN, LSTM, and CNN-LSTM with stacking gave less error than others.
Anand et al. ²⁸	Brain Tumor	VGG19, CNN	Weighted Average	98	Grid search is used for weight initialization.
Osamor and Okezie ¹⁹	Tuber-culosis	NB, SVM	Weighted Voting	96	Simple ensemble model with better performance.

Proposed methodology

A new Gradient-Based Weight Optimized Ensemble Model (GBWOEM), which uses a variety of base models to improve prediction performance, has been introduced. Figure 1 provides a full overview of the GBWOEM’s design. A dataset is first obtained from the UCI repository, and then extensive Exploratory Data Analysis (EDA), pre-processing, data transformation, and normalization are performed to ensure that the data are ready for modelling. To enable extensive model evaluation and performance assessment, the dataset was divided into training, validation, and testing sets. We chose five diverse base models for the ensemble model, each representing a different modelling approach: Logistic Regression (LR) for statistical strength, Decision Tree Classifier (DTC) for interpretability, Random Forest Classifier (RFC) for robustness and ensemble capabilities, Multi-Layer Perceptron (MLP) for deep learning, and K-Nearest Neighbours (KNN) for distance-based methodology.

Figure 1. Proposed methodology.

Grid Search is used to improve hyperparameters for each of these foundation models, and K-Fold Cross-Validation is used to guarantee generalization across various training data subsets. Five separately optimized models are the end product of this method. The GBWOEM iteratively selects combinations of base models ( 5 C 2 , 5 C 3 , 5 C 4 , and 5 C 5 ) or Weighted Average Models (WAM) using the five optimized models that are input. Using the validation dataset, the GBWOEM adjusts the weights of the chosen base models during each iteration to ascertain their respective contributions to the ensemble. The effectiveness of the ensemble is then determined by evaluating its performance on a test dataset. The combinatorial possibilities of choosing various subsets of the five base models yields 26 potential combinations because of the GBWOEM method. By assessing the performance of every combination on the test set, GBWOEM determines the ensemble configuration that enhances the accuracy of the predictions. With a combination of statistical, distance-based, tree-based, and neural network techniques, the final model is chosen based on best predictive performance out of all the basic models.

Datasets

While developing a model, it is important to evaluate its performance on a variety of datasets. Different datasets represent different disease domains such as diabetes, cancer, thyroid, and obesity. Each dataset has distinct properties, including differences in feature types, noise level, and distribution. Testing the model’s performance across these broad domains helps determine the model’s ability to generalize new data. Using different datasets with diverse qualities and dimensions, evaluate the model’s robustness and determine whether the model consistently performs well or is impacted by the nature of the data. It can identify areas where the model can be improved, such as by handling imbalanced data or dealing with noise. The analysis of performance inconsistencies can aid in model adjustment and fine-tuning. Information regarding the datasets used in this experiment is presented in Table 2.

Table 2. Information of the dataset.

Name of the datasets	Number of instances	Number of features	Target variable	Data distribution	Missing value present?	Feature type
Breast Cancer Dataset (BC DS) ²⁹	569	32	Diagnosis Malignant/M = 1 Benign/B = 0	1: 37.3% 0: 62.7%	No	Numerical
Diabetic Retinopathy Debrecen (DRD DS) ³⁰	1151	20	Diabetic Retinopathy (1, 0)	1: 52.9% 0: 47.1%	No	Numerical
Pima Indians Diabetes Database (PID DS) ³¹	768	9	Outcome (1, 0)	1: 34.9% 0: 65.1%	Yes	Numerical
Obesity Dataset (OL DS) ³²	2111	17	Obesity Level Normal (0), Obesity (1)	1: 46.6% 0: 53.4%	No	Numerical, Categorical
Thyroid Disease (TH DS) ³³	3772	30	Thyroid Disease (1, 0)	1: 91.8% 0: 8.2%	Yes	Numerical, Categorical

Base models

An ensemble model in machine learning makes predictions by combining several “base learners” or “base estimators”, each of which performs a classification or prediction task. In our proposed work, to build the GBWOEM, Decision Tree Classifier (DTC), Logistic Regression (LR), Random Forest Classifier (RFC), K-Nearest Neighbour (KNN), and Multi-Layer Perceptron (MLP) is used.

LR is a widely used empirical model in clinical analyses. It serves multiple purposes, including classification and feature selection, and as a meta-learner in ensemble models. ^{34–
36} As a supervised ML algorithm, binary classification is the primary application of LR. It evaluates the relationship between one or more independent variables and categorizes data into distinct classes. Decision trees are used in complex decision-making processes or to predict patient outcomes based on features from large datasets. ^{37–
39} The trees divide data on the basis of feature values and provide optimal decisions with respect to certain criteria, for example, Gini impurity G = 1 − ∑ i = 1 n p i 2 or information gain IG = H ( Y ) − H ( Y | X ) , where H is entropy and p _i is the probability of class. A data point is assigned with a label of the majority class among its k nearest neighbours by the k-NN algorithm, where k is a user-specified hyperparameter. It matches these neighbours using distance metrics, such as the Euclidean distance d = ∑ i = 1 n ( x i − y i ) 2 . ⁴⁰ Methods such as bagging and boosting can be utilized with k-NN to develop more robust models that reduce noise sensitivity and improve the accuracy. ⁴¹ Random Forests combines multiple decision trees trained on bootstrap samples with a random subset of features, ⁴² significantly reducing overfitting. ⁴³ Taking this into consideration in the field of healthcare, it can be utilized with tasks pertaining to disease classification, risk prediction, and patient outcome forecasting. MLPs are deep learning models capable of performing complicated classification and regression tasks because their ability is to represent complex non-linear relationships using multiple layers of neurons. ⁴⁴ Using backpropagation and gradient descent, it minimizes a loss function and subsequently the network learns, where the loss function for classification tasks might be cross-entropy L = − ∑ i = 1 n y i log ( y i ̂ ) and for regression tasks, it might be MSE. The activation functions used in the hidden layers, such as ReLU or sigmoid MLP, introduce non-linearity, ⁴⁵ allowing them to pick up complex patterns in the data.

LR, DT, kNN, RF, and MLP were considered for our ensemble model. While LR models can only capture linear relationships, DTs describe non-linear interactions and variable importance. Although RFs improve ensemble performance by combining many decision trees, they are designed to promote generalization over a wide range of datasets. MLPs can capture highly non-linear relationships in large datasets with relatively complex feature representations. The KNN learns the local structure and variance, leading to fine-grained predictions based on data closeness.

GBWOEM

The Gradient-Based Weight Optimized Ensemble Model (GBWOEM) is a weighted average-based ensemble model that integrates five diverse base models: LR, DT, KNN, RF, and MLP. Figure 2 shows a flowchart of the algorithm. The ensemble approach harnesses the strengths of each base model, and the final prediction is calculated by averaging the outputs with appropriate weights. GBWOEM assigns and updates weights to each base model in a systematic manner based on the performance of the base models. The entire process begins by training the base models. Every base model is trained rigorously using a grid search CV along with k-folds to extract the best model out of all variations. The model with the best accuracy among the trained models is selected as the initial candidate that will undergo optimization through the GBWOEM algorithm.

Figure 2. Process flow of GBOWEM.

The first step in weight optimization is to establish the weights for every base model using two modes, GBWOEM-R, where the weights are randomly initialized and normalized to keep their sum equal to one. Alternatively, in GBWOEM-U, the weights are initialized uniformly. After the weight assignment, other hyper-parameters, including learning rate, number of iterations, and a patience parameter to stop early, are initialized. Subsequently, with these initial weights, the ensemble provides its first predictions based on the initial results and evaluates the ensemble using a loss function, which includes both training and validation losses. When the model fails to meet an early stopping condition, that is, if the validation loss does not improve within a certain number of iterations, the training is terminated so that overfitting can be avoided. However, if the stopping condition is not satisfied, the next phase involves calculating the gradient of the loss function with respect to the weights of the base models. Subsequently, the weights are updated using a gradient descent optimization technique to reduce the overall ensemble loss. This process is continued until the maximum number of iterations is reached, or the early stopping condition is met. The process is described in Algorithm 1.

Algorithm 1. Gradient Based Weight Optimized Ensemble Model (GBWOEM).

Input:
-
Learning rate ( α)
-
Number of iterations ( num iterations)
-
Patience for early stopping ( patience)
-
Weight initializer method ( weight initializer)

Output:
-
Trained ensemble model with optimized weights

Parameters:
-
Number of base models: n
-
Base models: { M 1 , M 2 , … , M n }
-
Ensemble predictions on training set: { Y ̂ train ( 1 ) , Y ̂ train ( 2 ) , … , Y ̂ train ( n ) }
-
Ensemble predictions on validation set: { Y ̂ val ( 1 ) , Y ̂ val ( 2 ) , … , Y ̂ val ( n ) }
-
Ensemble predictions on test set: { Y ̂ test ( 1 ) , Y ̂ test ( 2 ) , … , Y ̂ test ( n ) }
-
Weights: W { w 1 , w 2 , … , w n }

1. If weight_initializer == ‘random’: 2. W = [n random number sampled uniformly from [0,1)] 3. W normalized = W ∑ W 4. W = W normalized 5. Else If weight_initializer == ‘uniform’: 6. W = ones_array/n 7. Else: 8. Value Error 9. For each base model M i : 10. Fit M i on training data ( X train , Y train ) 11. Predict probabilities for X train , X val , X test 12. Store Predictions Y ̂ train ( 1 ) , Y ̂ train ( 2 ) , … , Y ̂ train ( n ) 13. For iteration t in range (num_iterations): 14. Compute ensemble predictions using current weights: 15. Y ̂ ensemble , train ( t ) = ∑ i = 1 n w i ( t ) . Y ̂ train ( i ) 16. Y ̂ ensemble , val ( t ) = ∑ i = 1 n w i ( t ) . Y ̂ val ( i ) 17. Y ̂ ensemble , train ( t ) = ∑ i = 1 n w i ( t ) . Y ̂ test ( i ) 18. Computing training and validation loss: 19. loss train ( t ) = Loss _ Function ( y train , Y ̂ ensemble , train ( t ) ) 20. loss val ( t ) = Loss _ Function ( y val , Y ̂ ensemble , val ( t ) ) 21. Early Stopping: 22. If loss val ( t ) < bes t loss : 23. best_loss = loss val ( t ) , best_weights = current_weight, Count = 0 24. Else: 25. count = count + 1 26. If count >= patience: Break 27. Compute gradient of loss w.r.t. weights: 28. ∇ L = 1 N . ( Y ̂ ensemble , train ( t ) − y train ) . Y ̂ train ( t ) T 29. Update weighs using gradient descent: 30. current_weights = current_weights - α . ∇ L

Gradient descent is the key component of our proposed algorithm, which optimizes the ensemble model by minimizing the custom loss function. By penalizing inaccurate predictions more severely, particularly when the model is overconfident, the binary-cross-entropy-based loss function ( Equation 1) ensures that the model can handle classification tasks effectively. loss i = ( y true , i . log ( y pred , i + ε ) + ( 1 − y true , i ) . log ( 1 − y pred , i + ε ) ) (1)

A small constant ε = 1 X 10 ⁻¹⁰ is introduced to ensure numerical stability in the logarithmic calculations when the prediction probability approached 0 or 1. Gradient descent helps to iteratively update the weights of the base models by calculating the direction of the steepest descent of the loss function. By minimizing the average loss, which is the average binary cross-entropy across all data points ( Equation 2), gradient descent allows the model to adjust its predictions in a way that maximizes accuracy while balancing the contributions of each base model in the ensemble. loss = 1 N ∑ i = 1 N loss i (2)

This iterative procedure continues until the loss converges, thereby guaranteeing that the ensemble model is fine-tuned to its optimal performance. The effectiveness of the algorithm also strongly depends on factors such as the learning rate, early stopping (patience), and aggregation strategy via weighted averages. The learning rate is a critical factor in solving the gradient descent optimization to achieve a trade-off between overshooting and slow convergence. Early stopping stops training when the model performance starts to improve to prevent overfitting. The “patience” parameter control number of iteration algorithm to wait before stopping, hence used to prevent over-fitting. Similarly, while aggregating, we can simply use weighted averaging by weights that best increase the overall predictive power of the model in the ensemble.

Evaluation matrix

In the context of imbalanced datasets, it is important to look beyond accuracy to comprehensively understand performance. Accuracy measures the correct predictions, but when there are different numbers of records for each class, it could lead to misleading conclusions. Precision centers around the percentage of the positive predictions made by the model that are correct, meaning that it determines how good the model is at not making false positives. Recall, or Sensitivity, measures how well the model identifies true positives among them and might be important when attempting to capture a minority class. The F1-score is the harmonic mean of precision and recall, avoiding either false positives or false negatives that dominate the evaluation. An ROC curve shows the trade-off between true-positive and false-positive rates at various thresholds, which aids in understanding the discriminative ability of a model. The AUC measures the capability of the model to distinguish between positive and negative classes by plotting a line on the ROC curve. Given the imbalanced nature of our datasets, these metrics are crucial for ensuring a more nuanced and reliable assessment of the model performance. The formulas to compute the performance matrices are mentioned in Table 3.

Table 3. Performance matrices and their mathematical notations.

Performance matric	Mathematical notation
Accuracy (Acc)	1 n ∑ i = 1 n 1 ( y i = y i ̂ )
Precision (P)	∑ i = 1 n 1 ( y i = 1 Λ y i ̂ = 1 ) ∑ i = 1 n 1 ( y i ̂ = 1 )
Recall (R)	∑ i = 1 n 1 ( y i = 1 Λ y i ̂ = 1 ) ∑ i = 1 n 1 ( y i = 1 )
F1-Score	2 x (P x R) / (P + R)
Area Under Curve (AUC)	AUC ≈ ∑ i = 1 n − 1 ( TPR i + TPR i + 1 ) × ( FPR i + 1 − FPR i ) 2

Note: Where n is the total number of observations in the dataset and y _i and y i ̂ are the true and false labels for the i ^th sample, respectively.

Experimental results

The experimental results of our ensemble model are presented in this section, and an in-depth analysis of its performance over several datasets is carried out. The effectiveness of GBWOEM is evaluated by both the variant GBWOEM-R (random initialization) and GBWOEM-U (uniform initialization) on five datasets with various dimensionalities and characters. Given the diversity of base models used in the ensemble: LR, DT, KNN, RF, and MLP, we methodically assessed combinations of two, three, four, and all five base models. We quantified the quality of these results using standard metrics and drew particular attention to how different weight initialization strategies affect the generalizability of an ensemble across datasets with different dimensionalities.

Breast cancer dataset

This dataset has 369 entries, which has 31 features altogether derived from FNA images of breast tumors, with diagnosis as the target variable. A performance analysis of both variants of GBWOEM is presented in Figure 3. In GBWOEM-R, the LR (0.058) and RFC (1.002) pairs achieve the highest test accuracy and AUC values, where RFC having a significantly higher weight, indicating its stronger contribution. In GBWOEM-U, although the LR (0.5) and DTC (0.505) pair gave the highest accuracy when we considered both accuracy and AUC, the LR (0.5) and MLP (0.501) pair performs best. For both pairs, the weights of the base models are almost equal, suggesting a more balanced contribution from each base model. Pairs such as (LR, DTC), (LR, RFC), and (LR, KNN) are performing consistently across both variants, underlining their robustness. However, in both the models, increasing the number of base models led to overfitting, increases the training accuracy to 100%, but reduction in the test accuracy. This highlights the importance of careful base model selection and weight optimization to avoid overfitting when scaling up the ensemble.

Figure 3. Performance matrices of GBWOEM (R&U) on Breast Cancer Dataset. Pima Indians diabetes database

This is a common dataset utilized for research on diabetes and machine learning, which contains eight medical predictor variables, one target variable, and 768 entries. After data pre-processing, it is noticed that some columns have invalid 0’s, which are replaced with the min or median of those columns. The performances of both variants of our proposed ensemble model are presented in Figure 4. In GBWOEN-R, the (LR, RFC) (0.695, 1.0) and (DTC, RFC, KNN, MLP) (0.102, 0.727, 0.145, 0.027) pairs showed the highest accuracy, but (LR, RFC) achieved a higher AUC with RFC having the highest weights and more contribution to the final ensemble result. In GBWOEM-U, the combination of all five base models yielded the highest accuracy but a low AUC value. Again, (LR, RFC) (0.521, 1.102) achieves a balanced accuracy and AUC, with RFC dominating.

Figure 4. Performance matrices of GBWOEM (R&U) on Pima Indians Diabetes Database. Diabetic Retinopathy Debrecen Database

The Diabetic Retinopathy Debrecen Database, comprising 1,151 instances and 19 features, helps predict diabetic retinopathy using image-driven features. The performances of GBWOEM-R and GBWOEM-U are represented in Figure 5. In GBWOEM-R, (LR, MLP) (0.877, 0.542), (LR, RFC, MLP) (0.341, 0.433, 0.346), and (LR, KNN, MLP) (0.322, 0.25, 0.515) achieved the highest accuracy, whereas (LR, KNN, MLP) had the best AUC. From the final weight of each base model, it can be concluded that all models contribute significantly to the final result. The GBWOEM-U variant achieved slightly higher accuracy than GBWOEM-R, but had a lower AUC value. In this variant, (KNN, MLP) (0.501, 0.501) showed higher accuracy, where each model contributed equally. However, as per the AUC concern, the (LR, KNN, MLP) pair provides balanced results for both variants.

Figure 5. Performance matrices of GBWOEM (R&U) on Diabetic Retinopathy Database. Obesity database

This dataset includes 2,111 instances and 17 features related to lifestyle and dietary habits. The target column is divided into different classes, including normal weight, underweight, overweight, and obese (I, II, and III). For our experiment, we reclassified the target variable into binary classes, classifying all obesity levels as class 1 and other classes as 0 because the goal is to detect obesity. The performance analyses of GBWOEM-R and GBWOEM-U on this dataset are represented in Figure 6.

Figure 6. Performance matrices of GBWOEM (R&U) on Obesity Database.

Both variants performed exceptionally well, with accuracy and AUC values reaching or exceeding 99% across all combinations. In all GBWOEM-U pairs, weights are evenly distributed with minimal adjustment. Pair-like (LR, DTC), (LR, KNN), and (LR, RFC) algorithms continue to perform well, consistent with their performance across other datasets. For this dataset, the higher-order combinations (e.g., 5C3-5C5) perform well in the testing phase, and the training-testing accuracy gap is minimal, suggesting better generalization, possibly due to the nature of the attributes or the binary classification approach we employed.

Thyroid disease

Hypothyroid Disease includes 3,711 entries and 30 features for the detection of hypothyroidism. Both the variants GBWOEM-R and GBOWEM-U are performing well, where almost all the pairs are achieving accuracy over 93% and AUC over 96%, as presented in Figure 7. Similar to the Obesity dataset, the difference between the training and testing results is minimal, and higher-order combinations also perform well, likely resulting in a large sample size in both datasets. In GBWOEM-R, models such as DTC and RFC contribute more to their corresponding pairs with respect to other models. In GBWOEM-U, almost all base models have equal weights in their respective pairs and require minimal weight correction during optimization. This balance suggests a uniform contribution from all models, which may contribute to strong performance across multiple metrics.

Figure 7. Performance matrices of GBWOEM (R&U) on Thyroid Disease. Comparison with existing ensemble model

This section of the Results and Discussion section shows a comparison between GBWOEM and other existing ensemble models: Adaboost, Catboost, GradientBoost, LightGBM, and XGBoost. This will help us establish the potential benefits of our method in terms of precision, AUC, and generalization capability. Our model led to the enhancement of test accuracy compared to existing ensemble models, with improvements between 0.48% and 8.26% across all five datasets. The specific accuracy gains for each dataset were as follows: Breast Cancer (5.32%), Pima Indians Diabetes Database (2.60%), Diabetic Retinopathy Debrecen (8.26%), Obesity Level estimation based on physical condition and eating habits (0.48%), and Thyroid Disease (2.32%). Although these values differ in the amount of improvement they contribute, this highlights that one of the great features of GBWOEM is that it is capable of handling various datasets and tasks efficiently.

A comprehensive comparative quantitative performance for the details is given in Table 4 between the training and test accuracies of both the GBWOEM and existing models. In addition, ROC curves were drawn for each dataset for these five ensemble models and two variants of GBWOEM (GBWOEM-R and GBWOEM-U). It is very useful in the case of imbalanced datasets and helps us better understand how the model separates classes. From Figure 9, we can see that both variants of the proposed GBWOEM are able to distinguish between classes well for all datasets except PID DS. The models shows comparable results (80-90%) for performance matrices like precision, recall and F1-score, which rules out overfitting and is evident in Figures 3, 6, and 7. A bar chart ( Figure 8) is also included to show the training and test accuracies of the baseline models vs. our additionally proposed GBWOEM variants to graphically display these differences in our model accuracy.

Table 4. Test accuracy and test AUC comparison between GBWOEM and existing ensemble models.

Dataset	Test accuracy						Test AUC
Dataset	AB	CB	GB	LGBM	XGB	GBWOEM	AB	CB	GB	LGBM	XGB	GBWOEM
BC DS	70.43	69.57	71.74	62.61	74.68	80	70.38	69.98	72.14	61.41	69.85	89.65
PID DS	75.32	71.43	74.68	74.03	74.68	77.92	71.2	67.35	70.28	69.78	69.85	83.69
DRD DS	70.43	70.43	69.57	71.74	62.61	80	71.12	70.38	69.98	72.14	61.41	89.65
OL DS	99.28	99.52	99.28	99.52	99.52	100	99.26	99.49	99.26	99.49	99.49	100
TH DS	95.79	94.73	95.54	95.79	94.79	98.11	82.35	87.11	85.94	82.35	84.47	99.74

Figure 8. Comparison of training and testing accuracy between existing and proposed ensemble model for all the datasets. Figure 9. ROC Plot of existing and proposed GBWOEM (R&U) model (a) BC DS, (b) PID DS, (c) DRD DS, (d) OL DS and (e) TH DS. Conclusion and future work

In this study, we proposed the Gradient-Based Weight Optimized Ensemble Model (GBWOEM) consisting two variants, GBWOEM-R (random initialization) and GBWOEM-U (uniform initialization), designed to improve classification performance by dynamic weight optimisation for LR, DTC, KNN, RFC and MLP base models. Using the weighted average approach, the model’s weights are treated as real-valued variables optimized using gradient descent. The model was evaluated on five diverse datasets: Breast Cancer, Pima Indians Diabetes Database, Diabetic Retinopathy Debrecen, Obesity Level estimation based on physical condition and eating habits, and Thyroid Disease, each with unique characteristics and dimensions. In the end, we observed significant improvements in test accuracy across all datasets, with a gain of 0.48% to 8.26%, as compared to existing ensemble models, namely Adaboost, Catboost, GradientBoost, LightGBM, and XGBoost. GBWOEM achieved its highest increase in accuracy on the Diabetic Retinopathy Debrecen dataset, suggesting that GBOWOEM effectively addresses complex, feature-rich datasets. While both GBWOEM variants showed similar functional behaviour, GBWOEM-R favoured certain based models like RFC due to uneven weight distribution, but GBWOEM-U showed even distribution of weights and delivered more balanced and stable results across the dataset. In addition, the ROC curves and AUC values confirmed GBWOEM’s robustness of GBWOEM on various datasets. Surprisingly, increasing the number of base models contextually increases the training accuracy (up 100% in some cases) but not test performance, emphasizing the risk of overfitting in ensemble models. The dynamic weight optimization of the GBWOEM was shown to be a key strength, allowing flexibility across datasets with different dimensions and class distributions. Future work aims to incorporate advanced weight optimization techniques, such as adaptive learning rates or metaheuristic approaches, and test the model in multi-class and large-scale datasets for broader validation while maintaining lower computational complexity.

Data availability

The datasets used in this research are publicly available and can be accessed through the following DOIs: •

Diabetic Retinopathy Debrecen ( https://archive.ics.uci.edu/dataset/329/diabetic+retinopathy+debrecen; doi:10.24432/C5XP4P),

•

Estimation of Obesity Levels Based On Eating Habits and Physical Condition ( https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition; doi:10.24432/C5H31Z),

•

Thyroid Disease ( https://archive.ics.uci.edu/dataset/102/thyroid+disease; doi:10.24432/C5D010).

•

The Pima Indians Diabetes Dataset, originally hosted on UCI ML Repository is no longer available there. However, it can be accessed via Kaggle at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database .

References 1

Mahajan

Uddin

Hajati

: Ensemble Learning for Disease Prediction: A Review. Healthcare. Jun. 2023;11(12):1808. 37372925

10.3390/healthcare11121808

PMC10298658

Wazid

Singh

Das

: An Ensemble-Based Machine Learning-Envisioned Intrusion Detection in Industry 5.0-Driven Healthcare Applications. IEEE Trans. Consum. Electron. Feb. 2024;70(1):1903–1912. 10.1109/TCE.2023.3318850

Abidi

Umer

Mian

: Big Data-Based Smart Health Monitoring System: Using Deep Ensemble Learning. IEEE Access. 2023;11:114880–114903. 10.1109/ACCESS.2023.3325323

Liu

C-L

: A bagging approach for improved predictive accuracy of intradialytic hypotension during hemodialysis treatment. Comput. Biol. Med. Apr. 2024;172:108244. 38457931

10.1016/j.compbiomed.2024.108244

Yin

: Predicting the climate impact of healthcare facilities using gradient boosting machines. Cleaner Environmental Systems. Mar. 2024;12:100155. 38444563

10.1016/j.cesys.2023.100155

PMC10909736

Rehman

Mujahid

Saba

: Optimised stacked machine learning algorithms for genomics and genetics disorder detection in the healthcare industry. Funct. Integr. Genomics. Feb. 2024;24(1):23. 38305949

10.1007/s10142-024-01289-z

Das

Nayak

Sahoo

: Machine Learning in Healthcare Analytics: A State-of-the-Art Review. Archives of Computational Methods in Engineering. Apr. 2024. 10.1007/s11831-024-10098-3

Younas

Usman

Yan

: A deep ensemble learning method for colorectal polyp classification with optimized network parameters. Appl. Intell. 2023;53(2):2410–2433. 10.1007/s10489-022-03689-9

Bhuiyan

Islam

: A new ensemble learning approach to detect malaria from microscopic red blood cell images. Sensors International. 2023;4(August 2022):100209. 10.1016/j.sintl.2022.100209

Marques

Ferreras

Torre-Diez

de la : An ensemble-based approach for automated medical diagnosis of malaria using EfficientNet. Multimed. Tools Appl. 2022;81(19):28061–28078. 35368860

10.1007/s11042-022-12624-6

PMC8964254

Ali

: A smart healthcare monitoring system for heart disease prediction based on ensemble deep learning and feature fusion. Information Fusion. 2020;63:208–222. 10.1016/j.inffus.2020.06.008

Dutta

: Early Prediction of Diabetes Using an Ensemble of Machine Learning Models. Int. J. Environ. Res. Public Health. 2022;19(19):1–25. 36231678

10.3390/ijerph191912378

PMC9566114

Ihnaini

: A Smart Healthcare Recommendation System for Multidisciplinary Diabetes Patients with Data Fusion Based on Deep Ensemble Learning. Comput. Intell. Neurosci. 2021;2021. 34567101

10.1155/2021/4243700

PMC8463188

Reddy

: An Ensemble based Machine Learning model for Diabetic Retinopathy Classification. International Conference on Emerging Trends in Information Technology and Engineering, ic-ETITE 2020. 2020; pp.1–6. 10.1109/ic-ETITE47903.2020.235

Bin Habib

AZS

Tasnim

: An Ensemble Hard Voting Model for Cardiovascular Disease Prediction. 2020 2nd International Conference on Sustainable Technologies for Industry 4.0, STI 2020. 2020; pp.19–20. 10.1109/STI50764.2020.9350514

Amin

Hasan

Zein-Sabatto

: An Explainable AI Framework for Artificial Intelligence of Medical Things. 2023 IEEE Globecom Workshops, GC Wkshps 2023. 2023; pp.2097–2102. 10.1109/GCWkshps58843.2023.10464798

El-Sappagh

Ali

Abuhmed

: Automatic detection of Alzheimer’s disease progression: An efficient information fusion approach with heterogeneous ensemble classifiers. Neurocomputing. 2022;512:203–224. 10.1016/j.neucom.2022.09.009

Borba De Souza

Campos Nobre

Becker

: DAC Stacking: A Deep Learning Ensemble to Classify Anxiety, Depression, and Their Comorbidity From Reddit Texts. IEEE J. Biomed. Health Inform. 2022;26(7):3303–3311. 35230959

10.1109/JBHI.2022.3151589

Osamor

Okezie

: Enhancing the weighted voting ensemble algorithm for tuberculosis predictive diagnosis. Sci. Rep. 2021;11(1):14806–14811. 34285324

10.1038/s41598-021-94347-6

PMC8292494

Subashchandrabose

John

Anbazhagu

: Ensemble Federated Learning Approach for Diagnostics of Multi-Order Lung Cancer. Diagnostics. 2023;13(19):1–14. 37835796

10.3390/diagnostics13193053

PMC10572651

Abbas

: Fused Weighted Federated Deep Extreme Machine Learning Based on Intelligent Lung Cancer Disease Prediction Model for Healthcare 5.0. Int. J. Intell. Syst. 2023;2023. 10.1155/2023/2599161

Kotei

Thirunavukarasu

: Ensemble Technique Coupled with Deep Transfer Learning Framework for Automatic Detection of Tuberculosis from Chest X-ray Radiographs. Healthcare (Switzerland). 2022;10(11). 10.3390/healthcare10112335

Prakash

: Implementation of Artificial Neural Network to Predict Diabetes with High-Quality Health System. Comput. Intell. Neurosci. 2022;2022:1–7. 35676959

10.1155/2022/1174173

PMC9170457

El-Rashidy

El-Sappagh

Abuhmed

: Intensive Care Unit Mortality Prediction: An Improved Patient-Specific Stacking Ensemble Model. IEEE Access. 2020;8:133541–133564. 10.1109/ACCESS.2020.3010556

Rajaraman

Siegelman

Alderson

: Iteratively Pruned Deep Learning Ensembles for COVID-19 Detection in Chest X-Rays. IEEE Access. 2020;8:115041–115050. 32742893

10.1109/ACCESS.2020.3003810

PMC7394290

Rajaraman

Antani

: Modality-Specific Deep Learning Model Ensembles Toward Improving TB Detection in Chest Radiographs. IEEE Access. 2020;8:27318–27326. 10.1109/ACCESS.2020.2971257

Juraev

El-Sappagh

Abdukhamidov

: Multilayer dynamic ensemble model for intensive care unit mortality prediction of neonate patients. J. Biomed. Inform. 2022;135(October):104216. 36208833

10.1016/j.jbi.2022.104216

Anand

: Weighted Average Ensemble Deep Learning Model for Stratification of Brain Tumor in MRI Images. Diagnostics. 2023;13(7). 37046538

10.3390/diagnostics13071320

PMC10093740

Wolberg

Mangasarian

Street

: Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. 10.24432/C5DW2B

Antal

Balint Hajdu : Diabetic Retinopathy Debrecen. UCI Machine Learning Repository. 10.24432/C5XP4P

UCI MACHINE LEARNING: Pima Indians Diabetes Database. Version 1.

Reference Source

Estimation of Obesity Levels Based On Eating Habits and Physical Condition. UCI Machine Learning Repository. 10.24432/C5H31Z

Quinlan

: Thyroid Disease. UCI Machine Learning Repository. 10.24432/C5D010

MurtiRawat

Panchal

Singh

: Breast Cancer Detection Using K-Nearest Neighbors, Logistic Regression and Ensemble Learning. 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC). IEEE;Jul. 2020; pp.534–540. 10.1109/ICESC48915.2020.9155783

Rahmatinejad

: A comparative study of explainable ensemble learning and logistic regression for predicting in-hospital mortality in the emergency department. Sci. Rep. Feb. 2024;14(1):3406. 38337000

10.1038/s41598-024-54038-4

PMC10858239

Rajendra

Latifi

: Prediction of diabetes using logistic regression and ensemble techniques. Computer Methods and Programs in Biomedicine Update. 2021;1:100032. 10.1016/j.cmpbup.2021.100032

Madyatmadja

Rianto

Andry

: Analysis of Big Data in Healthcare Using Decision Tree Algorithm. 2021 1st International Conference on Computer Science and Artificial Intelligence (ICCSAI). IEEE;Oct. 2021; pp.313–317. 10.1109/ICCSAI53272.2021.9609734

Mung

Phyu

: Effective Analytics on Healthcare Big Data Using Ensemble Learning. 2020 IEEE Conference on Computer Applications (ICCA). IEEE;Feb. 2020; pp.1–4. 10.1109/ICCA49400.2020.9022853

Jayasri

Aruna

: Big data analytics in health care by data mining and classification techniques. ICT Express. Jun. 2022;8(2):250–257. 10.1016/j.icte.2021.07.001

Alnowaiser

: Improving Healthcare Prediction of Diabetic Patients Using KNN Imputed Features and Tri-Ensemble Model. IEEE Access. 2024;12:16783–16793. 10.1109/ACCESS.2024.3359760

Zhang

Niu

: Wind Turbine Condition Monitoring Based on Bagging Ensemble Strategy and KNN Algorithm. IEEE Access. 2022;10:93412–93420. 10.1109/ACCESS.2022.3164717

Shanthakumari

Nalini

Vinothkumar

: Multi Disease Prediction System using Random Forest Algorithm in Healthcare System. 2022 International Mobile and Embedded Technology Conference (MECON). IEEE;Mar. 2022; pp.242–247. 10.1109/MECON53876.2022.9752432

Nafouanti

Mustapha

: Prediction on the fluoride contamination in groundwater at the Datong Basin, Northern China: Comparison of random forest, logistic regression and artificial neural network. Appl. Geochem. Sep. 2021;132:105054. 10.1016/j.apgeochem.2021.105054

Butt

Letchmunan

Ali

: Machine Learning Based Diabetes Classification and Prediction for Healthcare Applications. J. Healthc. Eng. Sep. 2021;2021:1–17. 34631003

10.1155/2021/9930985

PMC8500744

Chen

J-C

Wang

Y-M

: Comparing Activation Functions in Modeling Shoreline Variation Using Multilayer Perceptron Neural Network. Water (Basel). Apr. 2020;12(5):1281. 10.3390/w12051281

10.5256/f1000research.195371.r458348

Reviewer response for version 2

John

Anjum

1 Referee https://orcid.org/0000-0002-9873-8550 1Pushpagiri Medical College Hospital, Thiruvalla, Kerala, India

Competing interests: No competing interests were disclosed.

14 2 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

Thank you for allowing me to review this paper which is technically sound and is indexable material. There is novelty and a domain of technical contribution, with acceptable experimental benchmarking but there are a few concerns about its theoretical depth, clinical translation, and external validation.

Major Strengths

1. Novel Methodological Idea: The core idea — gradient-based dynamic weight optimization across heterogeneous base models — is relevant. The model’s combination of LR, DT, RF, MLP, KNN uses gradient descent to optimize ensemble weights dynamically and shows measurable performance gain (0.48–8.26%) vs standard ensembles. This novelty in applied ML is indexable though theoretical novelty may be insufficient.

2. Multi-Dataset Evaluation: The model used five healthcare datasets across domains were used — improves perceived generalizability.

3. Proper ML Pipeline Elements used included Grid search CV, Train/validation/test splits, Early stopping, and Multiple metrics (AUC, F1, precision, recall)- which meets baseline reproducibility expectations.

4. Transparency Elements: Because public datasets used are listed, and the algorithm pseudo code provided, there is an element of transparency.

Major Concern

1. Weak Theoretical Foundation: Though the model uses gradient descent, evidence of convergence behaviour, Stability conditions, Complexity scaling and Sensitivity to hyperparameters are not explained in detail. This has been flagged by other reviewers in terms of the lack of detailed theoretical rationale for the use of this model. Clarity on the theoretical model is necessary to improve the depth of understanding of the topic as a new one. Suggestions are to add, discussion on convexity, empirical convergence plots, and sensitivity analyses like learning rates, patience, and finalization.

2. Interpretability — As it stands the manuscript shows weight values without clinical meaning being demonstrated. Previous reviews have stated that the study focuses on accuracy but lacks interpretability; add SHAP/LIME. Healthcare ML publication require transparency in contributions, clinician usability, and decision pathway explanations. Please provide at least one of the following: SHAP global + local explanations, feature importance stability across folds, or case-based explanation examples.

3. External Validation Missing: For clinical MI credibility after publication, at least one of external dataset testing, Nested CV, or leave-one-dataset-out validation might be performed.

4. Possible Overfitting Signals: The authors show 100% training accuracy in some models with test performance dropping when model complexity increases. The authors can provide learning curves, calibration plots, or bootstrapped CI for AUC.

5. Class Imbalance Handling Not Methodologically Strong because the authors rely mostly rely on natural distribution, avoid SMOTE, and argue that performance metrics are stable. Not provided are PR-AUC reporting, Cost-sensitive learning, or Threshold optimization

6. Clinical Relevance Is Mostly Theoretical: The paper claims that this method Improves patient outcomes, with diagnostic consistency but there was no clinical deployment scenario, and no decision threshold analysis.

7. Dataset Choice Is Safe but Not Challenging: The datasets used for the stuffy are good for method comparison but weak for real clinical impact assessment.

8. Computational Costs important for regular use of the model like training time, memory, and scalability to large HER datasets are not assessed.

9. The inconsistencies in grammar, repetition of texts, density of figures, and claims of larger uses than that shown by the data, are minor flaws in the manuscript.

Recommendations: The manuscript presents a technically sound and novel ensemble optimization approach with consistent improvements across benchmark healthcare datasets. However, before strong endorsement, the work would benefit from enhanced theoretical grounding of the optimization procedure, additional validation on external datasets, and improved interpretability analyses to support real-world clinical applicability.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Qualitative Research Methods, Medical Students, Teaching, Learning, Public health

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.5256/f1000research.186779.r442725

Reviewer response for version 1

Helen

1 Referee 1SRM Institute of Science and Technology, Kattankulathur, Tamil Nadu, India

Competing interests: No competing interests were disclosed.

7 1 2026

2026

recommendation

approve-with-reservations

Novelty and Contribution:

The development of GBWOEM shows the meaningful advancement in ensemble learning method, with a clear focus on weight optimization. The proposed model is compared with standard algorithms and achieves improved accuracy, which highlights its potential contribution.

Methodology:

The implementation of two different weight initialization methods—random (GBWOEM-R) and uniform (GBWOEM-U)—and the comparative analysis effectively show the impact of weight initialization on ensemble performance.

Interpretability:

The paper could benefit from a more detailed explanation of how the optimized weights of base models contribute to the final predictions. This would enhance the interpretability and practical relevance of the model, especially in healthcare applications.

Class Imbalance:

The datasets used are reported to have class imbalance. The authors should clarify how this imbalance was addressed during training and evaluation to ensure the results are reliable.

Overfitting Concern:

In Figure 9, ROC curves with AUC values near 1.00 may indicate potential overfitting. The authors should verify these results, possibly by evaluating the models on independent or cross-validation datasets to ensure generalizability.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Machine Learning, Deep Learning

References 1

: Big data-driven optimal weighted fused features-based ensemble learning classifier for thyroid prediction with heuristic algorithm. Journal of Combinatorial Optimization .2025;49(4) : 10.1007/s10878-025-01304-4

10.1007/s10878-025-01304-4

Champati Rai

Satyananda

School of Computer Engineering, Kalinga Institute of Industrial Technology, Bhubaneswar, Odisha, India

Competing interests: There are no competing interests.

16 1 2026

Q1. The paper could benefit from a more detailed explanation of how the optimized weights of base models contribute to the final predictions. This would enhance the interpretability and practical relevance of the model, especially in healthcare applications.

Response:

We thank the reviewer for this valuable suggestion regarding interpretability. To address this, we have expanded the result to explicitly detail how the optimized weights reflect the contribution of each base model.

For example, as noted in the revised manuscript, in the GBWOEM-R variant, the Random Forest Classifier (RFC) was assigned a significantly higher weight (1.002) compared to Logistic Regression (0.058), quantitatively demonstrating RFC's stronger contribution to the decision boundary for this specific dataset. Additionally, as requested, we have updated the relevant performance comparison figures to include the final weights of Random Weight (RW) and Uniform Weight (UW) initialization strategies. This visual comparison highlights how our optimization method specifically identifies and prioritizes the most reliable classifiers for healthcare diagnostics, rather than relying on static or arbitrary averaging.

Q2. The datasets used are reported to have class imbalance. The authors should clarify how this imbalance was addressed during training and evaluation to ensure the results are reliable.

Response:

We appreciate the reviewer highlighting this important aspect of data validity. We have analysed the class distribution for all five datasets used in the study. Four of the datasets exhibit a ratio approximating 60:40, which falls within the range of practical balance where standard classifiers perform effectively without intervention.

For the dataset exhibiting an imbalance, we empirically evaluated the model's performance without synthetic balancing (such as SMOTE). We have clarified in the Methodology section that a part of the dataset is kept as test data set to test the model performance in unseen data. We observed that the model converged successfully with a negligible gap between training and testing performance metrics (specifically, Recall and F1-score). This indicates that the feature space was sufficiently distinct to allow the model to learn the minority class boundaries effectively. We have clarified that we avoided synthetic oversampling for this specific case to prevent the introduction of noise and to ensure the evaluation reflects the model's robustness on naturally occurring data distributions.

Q3. In Figure 9, ROC curves with AUC values near 1.00 may indicate potential overfitting. The authors should verify these results, possibly by evaluating the models on independent or cross-validation datasets to ensure generalizability.

Response:

We agree that AUC values approaching 1.00 warrant careful scrutiny. We have re-verified the results presented in Figure 9. The high AUC values are attributed to the high separability of the datasets, which is consistent with existing literature on the respective datasets.

Due to strict page length constraints in the submitted manuscript, we could not include the exhaustive Train-vs-Test comparison data for every model variant. However, to address the reviewer’s concern, we have provided this detailed analysis below (see Table 1). As illustrated in Table 1, the performance metrics (especially F1-score, recal,l and precision) for both training and testing phases fall within a consistent range with minimal deviation. The absence of a significant drop in performance from Training to Testing confirms that the model generalizes well to unseen data and is not memorizing the training set. We have added a statement to the Results section clarifying that this stability across splits validates the high AUC scores.

Table 1: Performance matrices of GBWOEM-R on Breast Cancer Dataset (up to 3 model pairs)

https://docs.google.com/spreadsheets/d/1BFQRoqDtJme5ZrmqXKYFwuNuKaGeyq7vQyu6L3k_PYo/edit?usp=sharing

10.5256/f1000research.186779.r427169

Reviewer response for version 1

PRIYA K

HEMA

1 Referee 1Easwari Engineering College, Chennai, Tamil Nadu, India

Competing interests: No competing interests were disclosed.

20 11 2025

2025

recommendation

approve

This manuscript presents a novel ensemble learning framework — Gradient-Based Weight Optimized Ensemble Model (GBWOEM)—which introduces a gradient-based weight optimization strategy for improving predictive accuracy in healthcare classification tasks. Two variants (GBWOEM-R and GBWOEM-U) are developed and evaluated across five diverse healthcare datasets.

The model is compared with well-established ensemble methods such as AdaBoost, CatBoost, GradientBoost, LightGBM, and XGBoost. The study is technically sound and methodologically comprehensive, offering meaningful improvements in prediction accuracy.

However, certain aspects require deeper theoretical justification, statistical validation, and

refinement to enhance its scientific robustness, clarity, and clinical relevance.

Observation 1:The paper outlines the use of gradient descent for weight optimization but lacks a detailed theoretical rationale for convergence, stability, or hyperparameter selection.

Recommendation 1:Include a mathematical discussion or citation to establish convergence guarantees, sensitivity analysis, and computationalcomplexity details.

Observation 2: The study focuses on accuracy but lacks interpretability, crucial for healthcare AI.

Recommendation 2: Add analyses using SHAP, LIME, or feature importance to show clinical relevance.

The manuscript demonstrates strong potential and novelty but requires theoretical, statistical, and interpretive enhancements. Addressing these revisions will significantly improve its scholarly rigor and readiness for indexing.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Thyroid Disorders, Deep Learning, Machine Learning, RNN, DNN, CNN, AI

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

References 1

: Intelligent Fusion of Heuristically Optimized 1DCNN with Weighted Optimized DNN for Thyroid Disorder Prediction Framework. International Journal of Information Technology & Decision Making .2025;24(05) : 10.1142/S0219622025500105 1397-1433

10.1142/S0219622025500105

10.1007/s10878-025-01304-4

: Innovative Framework for Thyroid Disease Detection by Leveraging Hybrid AGTEO Feature Selection and GRU Classification Model. International Research Journal of Multidisciplinary Technovation .2024; 10.54392/irjmt2439 112-127

10.54392/irjmt2439

Input:	- Learning rate ( α) - Number of iterations ( num iterations) - Patience for early stopping ( patience) - Weight initializer method ( weight initializer)
Output:	- Trained ensemble model with optimized weights
Parameters:	- Number of base models: n - Base models: { M 1 , M 2 , … , M n } - Ensemble predictions on training set: { Y ̂ train ( 1 ) , Y ̂ train ( 2 ) , … , Y ̂ train ( n ) } - Ensemble predictions on validation set: { Y ̂ val ( 1 ) , Y ̂ val ( 2 ) , … , Y ̂ val ( n ) } - Ensemble predictions on test set: { Y ̂ test ( 1 ) , Y ̂ test ( 2 ) , … , Y ̂ test ( n ) } - Weights: W { w 1 , w 2 , … , w n }
1. If weight_initializer == ‘random’: 2. W = [n random number sampled uniformly from [0,1)] 3. W normalized = W ∑ W 4. W = W normalized 5. Else If weight_initializer == ‘uniform’: 6. W = ones_array/n 7. Else: 8. Value Error 9. For each base model M i : 10. Fit M i on training data ( X train , Y train ) 11. Predict probabilities for X train , X val , X test 12. Store Predictions Y ̂ train ( 1 ) , Y ̂ train ( 2 ) , … , Y ̂ train ( n ) 13. For iteration t in range (num_iterations): 14. Compute ensemble predictions using current weights: 15. Y ̂ ensemble , train ( t ) = ∑ i = 1 n w i ( t ) . Y ̂ train ( i ) 16. Y ̂ ensemble , val ( t ) = ∑ i = 1 n w i ( t ) . Y ̂ val ( i ) 17. Y ̂ ensemble , train ( t ) = ∑ i = 1 n w i ( t ) . Y ̂ test ( i ) 18. Computing training and validation loss: 19. loss train ( t ) = Loss _ Function ( y train , Y ̂ ensemble , train ( t ) ) 20. loss val ( t ) = Loss _ Function ( y val , Y ̂ ensemble , val ( t ) ) 21. Early Stopping: 22. If loss val ( t ) < bes t loss : 23. best_loss = loss val ( t ) , best_weights = current_weight, Count = 0 24. Else: 25. count = count + 1 26. If count >= patience: Break 27. Compute gradient of loss w.r.t. weights: 28. ∇ L = 1 N . ( Y ̂ ensemble , train ( t ) − y train ) . Y ̂ train ( t ) T 29. Update weighs using gradient descent: 30. current_weights = current_weights - α . ∇ L