Development of a machine learning predictive model for early detection of breast cancer

Rinsy Rahman; Dola Saha; Winniecia Dkhar; Sathyendranath Malli; Neil Barnes Abraham

doi:10.12688/f1000research.161073.6

Home Browse Development of a machine learning predictive model for early detection...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Development of a machine learning predictive model for early detection of breast cancer

[version 6; peer review: 1 approved, 2 approved with reservations, 2 not approved]

Rinsy Rahman¹, Dola Saha ¹, Winniecia Dkhar ², Sathyendranath Malli³, Neil Barnes Abraham²

Rinsy Rahman¹, Dola Saha ¹, [...] Winniecia Dkhar ², Sathyendranath Malli³, Neil Barnes Abraham²

PUBLISHED 23 Jun 2026

Author details Author details

¹ Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
² Department of Medical Imaging Technology, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
³ School of Information Science, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India

Rinsy Rahman
Roles: Conceptualization, Data Curation, Investigation, Methodology, Software, Validation, Visualization, Writing – Original Draft Preparation

Dola Saha
Roles: Conceptualization, Data Curation, Methodology, Supervision, Validation, Writing – Original Draft Preparation

Winniecia Dkhar
Roles: Conceptualization, Data Curation, Investigation, Supervision, Validation, Visualization, Writing – Review & Editing

Sathyendranath Malli
Roles: Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization

Neil Barnes Abraham
Roles: Data Curation, Formal Analysis, Software, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

This article is included in the Manipal Academy of Higher Education gateway.

This article is included in the AI in Medicine and Healthcare collection.

Abstract

Background

Breast cancer remains a significant global health concern, with over 7.8 million cases reported in the last five years. Early detection and accurate classification are crucial for reducing mortality rates and improving outcomes. Machine learning (ML) has emerged as a transformative tool in medical imaging, enabling more efficient and accurate diagnostic processes.

Objective

This study aims to develop a machine learning-based predictive model for early detection and classification of breast cancer using the Wisconsin Breast Cancer Diagnostic dataset.

Methods

The dataset, comprising 569 samples and 33 features derived from fine needle aspirate biopsy images, was pre-processed through data cleaning, normalization using the Robust Scaler, and feature selection. Five supervised ML algorithms—Logistic Regression, Support Vector Classification (SVC) with linear and radial basis function (RBF) kernels, Decision Tree, and Random Forest—were implemented. Models were evaluated using performance metrics, including accuracy, precision, sensitivity, specificity, and F1 scores.

Results

The SVC-RBF model demonstrated the highest accuracy (98.68%) and balanced performance across other metrics, making it the most effective classifier for distinguishing between benign and malignant tumors. Key features such as texture mean and area (worst) significantly contributed to classification accuracy.

Conclusions

This study highlights the potential of ML algorithms, particularly SVC-RBF, to revolutionize breast cancer diagnostics through improved accuracy and efficiency. Future research should validate these findings with diverse datasets and explore their integration into clinical workflows to enhance decision-making and patient care.

Keywords

Breast cancer, Mammography, Machine learning, Tumor classification, Predictive modelling

Corresponding authors: Dola Saha, Winniecia Dkhar

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2026 Rahman R et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Rahman R, Saha D, Dkhar W et al. Development of a machine learning predictive model for early detection of breast cancer [version 6; peer review: 1 approved, 2 approved with reservations, 2 not approved]. F1000Research 2026, 14:164 (https://doi.org/10.12688/f1000research.161073.6) First published: 05 Feb 2025, 14:164 (https://doi.org/10.12688/f1000research.161073.1) Latest published: 23 Jun 2026, 14:164 (https://doi.org/10.12688/f1000research.161073.6)

Revised Amendments from Version 5

In this revised version, the manuscript has been substantially updated in response to reviewer comments. The Introduction has been strengthened through the inclusion of additional recent literature, discussion of limitations in previous studies, and clearer justification for the selected machine learning models. The Methods section now provides more detailed information on data preprocessing, feature scaling, hyperparameter tuning, evaluation metrics, and the computational environment used for model development. Additional performance metrics, including Geometric Mean (G-Mean) and Matthews Correlation Coefficient (MCC), have been incorporated, together with statistical validation using five-fold cross-validation, mean accuracy, standard deviation, and root mean square error (RMSE). A new comparative table summarising related studies, their datasets, models, results, and limitations has been added, and the Discussion has been expanded to compare the present findings with previous research. Furthermore, dedicated sections describing the strengths and limitations of the study have been included, and the Conclusion has been revised to provide a clearer summary of the findings, acknowledge study limitations, and outline future research directions.

See the authors' detailed response to the review by Musatafa Abbas Abbood Albadr
See the authors' detailed response to the review by Manna Debnath
See the authors' detailed response to the review by Rolando Gonzales Martinez
See the authors' detailed response to the review by Chandrakanta Mahanty
See the authors' detailed response to the review by Abicumaran Uthamacumaran

1. Introduction

Breast cancer is a global health concern that affects millions of women worldwide. The alarming number of diagnoses highlights the importance of proactive measures such as regular screenings, self-examination, and increased awareness. In the last five years alone, a staggering 7.8 million women have been diagnosed with this disease.¹ These numbers underscore the urgent need for increased awareness, early detection, and effective treatment options. The health system must be significantly reinforced to enhance breast cancer outcomes. To reduce mortality rates and provide effective treatment, early detection and screening of breast cancer are highly important.^2,3 Early detection is therefore essential to ensure the best outcome in treating breast cancer. It is well known that rapid diagnosis with machine learning is highly beneficial, considering the rise in breast cancer cases.⁴

The integration of AI in breast cancer detection and diagnosis has the potential to revolutionize the field of oncology.^5,6 In recent years, machine learning (ML) algorithms have emerged as powerful tools in the field of medical imaging, offering the potential to enhance the accuracy and efficiency of tumour detection and classification.^7,8 Machine learning algorithms can analyse vast amounts of data and identify patterns that may not be apparent to human experts. Machine learning algorithms can be trained to analyse mammograms and provide additional insights to radiologists, helping them make more informed decisions. Healthcare providers and researchers must continue to explore and harness the power of AI to further enhance breast cancer care.^8–10

Several machine learning and deep learning approaches have been proposed for breast cancer diagnosis, including Logistic Regression, Support Vector Machines, Decision Trees, Random Forests, Artificial Neural Networks, and Convolutional Neural Networks. Previous studies have reported promising classification performance; however, differences in preprocessing strategies, feature selection methods, model complexity, and evaluation procedures have resulted in variable outcomes.^11,12 Furthermore, some deep learning approaches require large datasets and substantial computational resources, limiting their applicability in resource-constrained environments. Therefore, there remains a need for robust and computationally efficient machine learning models capable of achieving high diagnostic accuracy while maintaining ease of implementation and interpretability. Although previous studies have reported promising results, many relied on complex models requiring large datasets and high computational resources.^13,14 Additionally, differences in preprocessing, feature selection, and evaluation methods have limited direct comparison of model performance. Therefore, there remains a need for accurate, interpretable, and computationally efficient machine learning models for breast cancer diagnosis.^15–17

To address the need for accurate and efficient breast cancer diagnosis, this study evaluated Logistic Regression, Support Vector Classification (Linear and Radial Basis Function kernels), Decision Tree, and Random Forest classifiers using the Wisconsin Breast Cancer Diagnostic dataset. These algorithms were selected because of their proven effectiveness in medical classification tasks. The primary objective was to identify the most reliable model for distinguishing benign and malignant breast lesions and to improve diagnostic accuracy for early breast cancer detection.

2. Methods

2.1 Study design and setting

This research was conducted within the Health Informatics Laboratory, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, over six months (January–June 2022). The study aimed to develop and evaluate a machine learning predictive model for early detection and differential diagnosis of benign and malignant breast lesions.

2.2 Data source and inclusion criteria

The Wisconsin Breast Cancer Diagnostic dataset, available on Kaggle,¹⁸ was utilized. This dataset comprises 569 records and 33 features derived from fine needle aspirate (FNA) biopsy images, representing tumor characteristics. Key features analysed included tumor radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension.

2.3 Data preprocessing

Data preprocessing was undertaken in several steps to ensure the dataset was both reliable and suitable for model development. Missing and null values were first removed. The features were then normalized using the Robust Scaler, which centers values on the median and scales them by the interquartile range. This choice was made to reduce the influence of outliers and to better accommodate the non-Gaussian feature distributions often present in medical datasets, while still preserving meaningful relationships between variables. To gain a deeper understanding of the data, Exploratory Data Analysis (EDA) was performed in Python using violin plots, box plots, and correlation matrices, which guided the identification of clinically and statistically relevant features. During this process, strong correlations were observed among certain predictors (e.g., 0.86 between concavity worst and concave points worst). To manage potential multicollinearity, we used tree-based algorithms such as Random Forest and Decision Tree, which are naturally robust to correlated inputs, while regularization in linear models (Logistic Regression and SVM) further reduced redundancy effects. Because our focus was on predictive performance rather than coefficient-level interpretation, these correlations were not expected to bias results. Finally, the dataset was split into input features (X) and target labels (y), with categorical diagnosis values encoded into binary form (0 = benign, 1 = malignant). Dimensionality reduction methods such as PCA or t-SNE were not applied, as the dataset included only 30 features, which allowed efficient computation and straightforward interpretation. Preserving the clinical interpretability of individual features was also prioritized over transformations like PCA, which produces composite variables, or t-SNE, which is mainly intended for visualization. Features were normalized using the Robust Scaler, which centers on the median and scales by the interquartile range to minimize the influence of outliers while preserving inter-feature relationships in high-dimensional medical data.

2.4 Model development

Five supervised machine learning algorithms were implemented: Logistic Regression, Support Vector Classification (SVC) with linear and radial basis function (RBF) kernels, Decision Tree, and Random Forest. The dataset was divided into training and testing subsets using a 60:40 split with Scikit-learn’s train test split function. Models were trained on the training set and optimized through hyperparameter tuning using GridSearchCV with five-fold cross-validation to improve model generalizability and reduce the risk of overfitting. For the SVC with RBF kernel, the parameter grid explored included C = [0.1, 1, 10, 100] and gamma = [‘scale’, 0.01, 0.001], with the kernel fixed as ‘rbf’. Similar tuning strategies were applied for the remaining classifiers. The optimal parameters identified through cross-validation were used for final model training. Logistic Regression with C = 1.0; SVC with linear kernel using C = 1.0; SVC with RBF kernel using C = 1.0 and gamma = ‘scale’; Decision Tree with max_depth = 5 and criterion = ‘gini’; and Random Forest with n_estimators = 100, max_depth = 6, and criterion = ‘entropy’. These parameters were selected based on the highest cross-validated accuracy and were subsequently used for final model evaluation.

2.5 Performance evaluation

The performance of the classification models was evaluated using accuracy, precision, sensitivity (recall), specificity, and F1-score. These metrics were calculated from the confusion matrix, where TP, TN, FP, and FN represent true positives, true negatives, false positives, and false negatives, respectively.

Accuracy = \frac{(TP + TN)}{(TP + TN + FP + FN)}

Precision = \frac{TP}{(TP + FP)}

Sensitivity (Recall) = \frac{TP}{(TP + FN)}

Specificity = \frac{TN}{(TN + FP)}

F1-score = 2 \times \frac{(Precision \times Recall)}{(Precision + Recall)}

These evaluation metrics are widely used for assessing the performance of machine learning classification models and were calculated using the Scikit-learn library.

Among the models, SVC-RBF demonstrated the highest accuracy (99%), proving its efficacy for early detection and differential diagnosis of breast lesions. These metrics were evaluated using both the test dataset and during the 5-fold cross-validation phase to ensure consistent generalization across splits. This approach allows for a robust evaluation by averaging performance across multiple folds.

Root Mean Square Error (RMSE)

RMSE measures the average magnitude of prediction errors and was used as an additional indicator of model reliability.

RMSE = \sqrt [(1 / n) \times Σ (y_{i} - ŷ_{i})^{2}]

where y_i is the actual class label, ŷ_i is the predicted class label, and n is the total number of observations. Lower RMSE values indicate better predictive performance and greater model consistency.

2.6 Statistical evaluation metrics

Accuracy represents the proportion of correctly classified instances among all observations.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Precision measures the proportion of positive predictions that are correct.

Precision = \frac{TP}{TP + FP}

Sensitivity (Recall) measures the ability of the classifier to correctly identify positive cases.

Sensitivity = \frac{TP}{TP + FN}

Specificity measures the ability of the classifier to correctly identify negative cases.

Specificity = \frac{TN}{TN + FP}

The F1-score provides a harmonic balance between precision and recall.

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

G-Mean evaluates the balance between sensitivity and specificity.

G - Mean = \sqrt{Sensitivity \times Specificity}

MCC is a correlation-based measure that evaluates the overall quality of binary classifications.

MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}

2.7 Statistical tools and software

All analyses were performed using Python 3.8¹⁹ in Jupyter Notebook. Libraries used included Pandas (v1.2.4)²⁰ for data manipulation, Numpy (v1.20.3)²¹ for numerical computations, Matplotlib (v3.4.2)²² and Seaborn (v0.11.1)²³ for data visualization, and Scikit-learn (v0.24.2)²⁴ for machine learning, while data preprocessing and visualization were carried out using Pandas, NumPy, Matplotlib, and Seaborn. The experiments were executed on a workstation equipped with an Intel Core i7 processor, 16 GB RAM, and the Windows 10 operating system. Given the relatively small size of the Wisconsin Breast Cancer Diagnostic dataset, dedicated GPU resources were not required for model training or evaluation.

2.8 Ethical considerations

The dataset was extracted from the online open-source Wisconsin (Diagnostics) dataset. The study approval was obtained from Institutional Research Committee of Manipal College of Health Professions, Manipal on the 20^th of January 2022 (MCHP/Mpl/IRC/PG/2022/04). All procedures adhered to established ethical guidelines for secondary data analysis and data use policies. Consent is not applicable since the data was an extracted from the online open source Wisconsin (Diagnostics) dataset.

3. Results

3.1 Exploratory Data Analysis (EDA) and data preprocessing

The breast cancer dataset, comprising 569 samples, was subjected to exploratory data analysis (EDA) to evaluate its structure and identify relevant features. During data preprocessing, two non-informative columns, ‘id’ and ‘Unnamed: 32’ (which contained only missing values), were removed. Analysis of the target variable (‘diagnosis’) showed that the dataset included 212 malignant (37.3%) and 357 benign (62.7%) cases, indicating a predominance of benign samples. A bar graph ( Figure 1) illustrates this distribution. Following data cleaning, the dataset was divided into feature variables (X) and the target variable (y), ensuring all numeric features remained in X while the categorical “diagnosis” variable was placed in y.

Figure 1. Bar graph showing the frequency of diagnosis column.

M - Malignant Tumor and B - Benign Tumor.

3.2 Feature extraction and visualization

3.2.1 Violin plots: - The distributions of the first thirty features in the dataset were visualized using violin plots to assess their potential for distinguishing between malignant and benign tumors. Key findings include the texture mean, which displayed distinct median values for the tumor types and a wider spread in the kernel density estimate (KDE) for malignant tumors, suggesting its potential as a useful feature for classification. In contrast, the fractal dimension mean showed similar medians for both tumor types, indicating limited discriminative power. Features such as concave points (se) and concavity (se) also exhibited overlapping distributions, making them less valuable for classification. On the other hand, area (se) demonstrated a clear separation between tumor types, highlighting its potential for classification. Similarly, the area (worst) feature showed a distinct separation between benign and malignant tumors, marking it as a strong candidate for classification models, whereas fractal dimension (worst) and concavity (worst) exhibited overlapping distributions, suggesting reduced utility. Overall, texture mean, area (se), and area (worst) emerged as the most promising features for classification, while the others showed limited differentiation between tumor types in the Figure 2(A, B, C).

Figure 2. (A) Violin plot for first ten features (B) for second set of features (C) for last set of features (D) Joint plot for finding corelation between the concave wort and concavity worst.

3.2.2 Joint plot: - A joint plot was used to analyze the relationship between concavity worst and concave points worst, as their distributions appeared to be similar. The joint plot, which combines scatter plots and histograms, provides a comprehensive view of the data’s distribution and the relationship between two variables. The analysis revealed a strong correlation of 0.86 between the two features, accompanied by a statistically significant p-value. This indicates a high degree of linear association between concavity worst and concave points worst, suggesting that they capture similar information regarding the tumor characteristics. Given their strong correlation, retaining only one of these features in the classification model is advisable, as including both would introduce redundancy and not contribute additional discriminative power in Figure 2(D).

3.2.3 Box plot : - Box plots were used to visualize the distribution of key features across malignant and benign tumor groups, offering a clear representation of the data’s spread, central tendency, and variability. These plots divide the data into quartiles, highlighting the minimum, first quartile, median, third quartile, and maximum values, and can also identify potential outliers. Box plots are useful for comparing feature distributions between groups and identifying differences in spread and central values.

In this study, box plots were employed to explore the relationship between highly correlated features in the correlation matrix, such as texture mean and texture worst, as well as area mean and area worst. The analysis of these features in relation to the diagnosis column revealed similar distributions for malignant and benign tumors, indicating redundancy in the information they provide. For instance, texture mean and texture worst showed comparable distributions, suggesting that retaining both features in the model would likely result in redundancy. Consequently, one of these highly correlated features can be excluded from the classification process without sacrificing predictive power. These insights were further validated through the visual examination of box plots, which helped clarify how each feature discriminates between malignant and benign groups in Figure 3(A, B, C, D).

Figure 3. (A) Box plot graph of texture mean vs diagnosis of tumor (B) texture worse vs diagnosis of tumor (C) area mean vs diagnosis of tumor (D) area worst vs diagnosis of tumor.

3.3 Label encoding

Label encoding was employed to handle the categorical data within the dataset, specifically the diagnosis column, which consists of two classes: malignant (M) and benign (B). Label encoding is a technique used to transform categorical variables into numerical values, facilitating their inclusion in machine learning models that require numerical input. In this case, the diagnosis feature was encoded by assigning the value 0 to benign tumors and 1 to malignant tumors. This transformation of categorical data into binary values enables the classification algorithms to process the target variable effectively.

Label encoding is particularly useful for datasets with binary or ordinal categorical data, as it preserves the inherent order and structure of the classes. This method of encoding ensures that the diagnosis column can be used seamlessly in the machine learning models, enhancing the classification process and improving model performance. The encoded values (0 and 1) were then incorporated into the feature set, with the remaining extracted features, such as tumor radius, texture, perimeter, and others, remaining in their continuous form.

3.4 Dataset splitting and feature scaling

In this study, the dataset was divided into training and testing sets using the train-test split method to evaluate the performance of machine learning algorithms. The dataset was split with a 60:40 ratio, where 60% of the data was used for training the model, and 40% was reserved for testing. The primary goal of this split is to assess how well the model generalizes to unseen data by training it on the training set and evaluating it on the testing set. The training set allows the model to learn from known data, while the testing set is used exclusively for making predictions, providing an unbiased estimate of model performance.

The dataset was divided into input features (X) and the target variable (y). The target variable, diagnosis (benign or malignant), was assigned to y, and the remaining features used for classification were assigned to X. Consequently, the dataset was split into four variables: X train, X test, y train, and y test, representing the training and testing sets for both features and target variable.

Following the train-test split, feature scaling was performed to normalize the features within the dataset. Feature scaling is a preprocessing technique used to transform the features into a uniform scale, improving the performance of machine learning algorithms. In this study, the Robust Scaler was applied, which scales the data based on the interquartile range (IQR) while removing the median. This scaling method ensures that outliers have a minimal effect on the data, which is particularly beneficial when dealing with features that have different scales or units. The scaled data was then used for model training and evaluation, ensuring that all features contribute equally to the learning process.

Among the features, texture mean and area (worst) emerged as the most discriminative, supported by their strong correlations with diagnostic class labels and their consistently high importance rankings across tree-based models. Although advanced interpretability methods such as SHAP or mutual information scores were not employed, these complementary quantitative measures provided robust evidence of their significance.

3.5 Model development

In this study, various machine learning models were developed and evaluated using different supervised classification algorithms to identify the most accurate model for classifying benign and malignant breast lesions. A classifier algorithm is designed to map input data to specific categories, making it suitable for tasks such as classification of breast lesions. The algorithms utilized in this project include Logistic Regression, Support Vector Classifier (SVC) with a linear kernel, Support Vector Classifier (SVC) with a radial basis function (RBF) kernel, Decision Tree Classifier, and Random Forest Classifier.

The models were developed using the training dataset, with each classifier being imported from the learn library. The models were assigned to variables, and the fit method was used to train each model on the input features (X train) and target variable (y train). This method enabled the models to learn from the data and adjust their parameters accordingly to improve classification performance.

Following the training process, the accuracy of each model was calculated to assess their performance. The Decision Tree Classifier achieved the highest training accuracy of 1.0, indicating perfect classification performance on the training set. On the other hand, the SVC with the radial basis function kernel exhibited the lowest training accuracy among the classifiers. These results provide an indication of which models performed better in terms of training accuracy and highlight the potential for further model evaluation using additional metrics such as cross-validation, precision, recall, and F1 score to determine the most reliable classifier for the task ( Figure 4).

Figure 4. Forest plot comparing performance metrics across different classifiers.

3.6 Performance evaluation

The evaluation of the classification models was performed to determine their effectiveness in distinguishing between benign and malignant breast lesions. Testing accuracy was calculated using a confusion matrix, which summarizes the performance of the classification models in terms of actual and predicted values. The confusion matrix provided four key metrics: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) values for each classification algorithm, as shown in Table 1.

Table 1. Confusion matrix values of the classification algorithms.

Classification Algorithm	True Positive (TP) Value	True Negative (TN) Value	False Positive (FP) Value	False Negative (FN) Value
Logistic Regression	142	83	2	2
SVC Linear	139	81	4	4
SVC RBF	142	83	1	2
Decision Tree	139	76	4	9
Random Forest	138	80	5	5

Among the classification algorithms, the Support Vector Classifier (SVC) with a Radial Basis Function (RBF) exhibited the highest testing accuracy of 0.986, indicating its superior ability to predict correctly. In contrast, the Decision Tree Classifier demonstrated the lowest testing accuracy of 0.942, suggesting room for improvement in its predictive capability.

The classification model was evaluated based on key performance metrics, including accuracy, precision, recall, and F1-score. Accuracy measures the overall effectiveness of the model in correctly classifying cases, while precision assesses the proportion of correctly identified positive cases out of all predicted positives. Recall, also known as sensitivity, indicates the model’s ability to correctly detect positive cases, and the F1-score provides a harmonic mean between precision and recall, ensuring a balanced evaluation.

To further assess the model’s discriminative power, we generated a Receiver Operating Characteristic (ROC) curve, which illustrates the trade-off between sensitivity and specificity across different classification thresholds. The Area Under the Curve (AUC) value quantifies the model’s ability to distinguish between benign and malignant cases, with a higher AUC indicating superior classification performance. Figure 5 presents the ROC curve, demonstrating the classifier’s effectiveness in minimizing false positives while maximizing true positive rates.

Figure 5. Receiver Operating Characteristic (ROC) curve for the classification model.

The curve illustrates the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) across different thresholds. The Area Under the Curve (AUC) value indicates the model’s ability to distinguish between benign and malignant cases, with higher AUC values representing better classification performance.

3.7 Additional performance metrics

To further assess the quality of predictions, additional metrics such as precision, sensitivity (recall), F1 score, and specificity were calculated using the classification report function from the sklearn metrics package. These metrics evaluate the balance between true positive predictions and false positives/negatives, providing a comprehensive assessment of the classification algorithms The Support Vector Classifier with Radial Basis Function (SVC RBF) demonstrated the highest testing accuracy (0.9868) and consistently high precision, recall, F1 score, and specificity, establishing itself as the most robust classifier in this study. Logistic Regression performed comparably, achieving a testing accuracy of 0.9825, indicating reliable classification performance. In contrast, the Decision Tree Classifier, despite achieving the highest training accuracy (1.0), exhibited the lowest testing accuracy (0.9429), suggesting potential overfitting during training. The Random Forest classifier displayed a balanced performance, with a testing accuracy of 0.9561 and comparable metrics across precision, recall, and F1 score, making it a reliable but less optimal choice than SVC RBF and Logistic Regression in Table 2.

Table 2. Performance evaluation of the classification algorithm.

Classification Algorithms	Training Accuracy	Testing Accuracy	Final Accuracy	Precision		Sensitivity/Recall		F1 Score		Specificity/Support		G-Mean	MCC
Classification Algorithms	Training Accuracy	Testing Accuracy	Final Accuracy	“B”	“M”	“B”	“M”	“B”	“M”	“B”	“M”	G-Mean	MCC
Logistic Regression	0.9824	0.9825	0.98	0.99	0.98	0.99	0.98	0.99	0.98	143	85	0.979	0.955
SVC Linear	0.9883	0.9649	0.96	0.97	0.95	0.97	0.95	0.97	0.95	143	85	0.953	0.914
SVC RBF	0.9795	0.9868	0.99	0.99	0.99	0.99	0.98	0.99	0.98	143	85	0.982	0.965
Decision Tree	1.0	0.9429	0.94	0.94	0.95	0.97	0.89	0.96	0.92	143	85	0.919	0.865
Random Forest	0.9941	0.9561	0.96	0.97	0.94	0.97	0.94	0.97	0.94	143	85	0.940	0.892

3.8 Statistical validation of classifier performance

To evaluate the robustness and consistency of the developed machine learning models, five-fold cross-validation was performed. Mean Accuracy, Standard Deviation (STD), and Root Mean Square Error (RMSE) were calculated for each classifier. Mean accuracy represents the average predictive performance across the validation folds, whereas the standard deviation reflects the variability of model performance. RMSE was used as an additional measure of prediction error, with lower values indicating better model reliability.

As shown in Table 3, Logistic Regression achieved the highest mean accuracy (97.89%) with the lowest standard deviation (0.0147) and RMSE (0.0248), indicating stable and consistent performance. SVC-RBF achieved a mean accuracy of 97.37% with an RMSE of 0.0353, demonstrating strong predictive capability across the validation folds. In contrast, the Decision Tree classifier exhibited the lowest mean accuracy (92.80%) and the highest RMSE (0.0756), indicating comparatively greater prediction variability.

Table 3. Statistical validation of classifier performance using five-fold cross-validation.

Classifier	Mean accuracy	STD	RMSE
Logistic Regression	0.9789	0.0147	0.0248
SVC Linear	0.9754	0.0190	0.0299
SVC-RBF	0.9737	0.0263	0.0353
Decision Tree	0.9280	0.0258	0.0756
Random Forest	0.9508	0.0181	0.0518

STD: Standard Deviation; RMSE: Root Mean Square Error.

4. Discussion

This study demonstrates the effectiveness of machine learning techniques for the early detection and differential diagnosis of benign and malignant breast lesions. Among the models evaluated, the Support Vector Classifier with a Radial Basis Function (SVC-RBF) kernel emerged as the top performer, achieving an accuracy of 99% on the Wisconsin Breast Cancer Diagnostic dataset. The model also showed excellent precision (99% for benign and 98% for malignant), sensitivity (99% and 98%, respectively), and strong F1 scores for both classes, highlighting its robustness in minimizing diagnostic errors. While the SVC-RBF model functions as a black-box algorithm, its consistently high predictive performance supports its potential utility in clinical decision-making. Furthermore, the model achieved an Area Under the ROC Curve (AUC) of 0.96, reflecting its excellent ability to discriminate between benign and malignant cases across all classification thresholds.

Exploratory data analysis (EDA), including violin plots, joint plots, and correlation matrices, revealed critical features such as texture mean, area (se), and area (worst), which were pivotal for classification. These insights enabled feature selection, improving the model’s accuracy while reducing redundancy. Comparatively, features like fractal dimension mean and concavity worst demonstrated limited diagnostic value.

As shown in Table 4, previous studies have demonstrated the effectiveness of machine learning and deep learning approaches for breast cancer classification. However, many studies were limited by small sample sizes, lack of external validation, high computational requirements, reduced model interpretability, or dependence on large datasets. The present study achieved a classification accuracy of 98.68% and an AUC of 0.96 using the SVC-RBF classifier on the Wisconsin Breast Cancer Diagnostic dataset. The findings indicate that a comparatively simple machine learning framework combined with robust preprocessing and feature selection can achieve competitive performance while maintaining computational efficiency.

Table 4. Summary of related studies comparing datasets, machine learning models, performance outcomes, and limitations in breast cancer classification.

Reference	Dataset	Feature Extraction	Model	Results	Weaknesses
Tahmooresi et al. (2019)²⁵	Wisconsin Breast Cancer Dataset	Morphological and statistical features	SVM	Accuracy: 94%	Limited feature optimization; lack of external validation
Kayode et al. (2019)²⁶	Mammography Images	Image texture features	Modified SVM	Sensitivity: 94.4%; Specificity: 91.3%	Small dataset; limited generalizability
Shen et al. (2019)²⁷	CBIS-DDSM, INbreast	Automatic deep feature extraction	Deep CNN	AUC: 0.91–0.98	High computational requirements; reduced interpretability
Suh et al. (2020)²⁸	Digital Mammography Dataset	Deep feature extraction	DenseNet-169, EfficientNet-B5	AUC: 0.952–0.954	Requires large datasets and GPUs
Viswanath et al. (2019)²⁹	Mammography Dataset	Image processing features	Random Forest	Accuracy: 84.84%	Lower predictive performance
Hussain et al. (2024)³⁰	Multiple datasets	Not applicable	Systematic Review	Comprehensive review	No experimental validation
Present Study	Wisconsin Breast Cancer Diagnostic Dataset	Robust Scaling + EDA-based Feature Selection	Logistic Regression, SVC, Decision Tree, Random Forest	SVC-RBF Accuracy: 98.68%; AUC: 0.96	Single dataset; no external validation

The findings surpass prior studies in terms of model performance. For instance, M. Tahmooresi et al.²⁵ reported an SVM accuracy of 94%, while Shen et al.,²⁷ developed a deep learning algorithm for breast cancer detection on mammograms using an “end-to-end” approach, achieving high accuracy across heterogeneous datasets such as CBIS-DDSM (AUC: 0.91) and IN breast (AUC: 0.98). This improvement is attributed to advanced preprocessing techniques, such as robust scaling and hyperparameter tuning, combined with a comprehensive evaluation framework. Kayode et al.’s²⁶ SVM model achieved a sensitivity of 94.4% and specificity of 91.3%, and Debelee et al.³¹ reported 99% accuracy on the BGH dataset. While these results are comparable, this study’s comprehensive evaluation, including confusion matrix-derived metrics, adds rigor to the findings. Similarly, Suh et al.²⁸ explored neural network models, such as DenseNet-169 and EfficientNet-B5, achieving AUCs of 0.952–0.954. However, these models require larger datasets and computational resources, unlike the efficient SVC-RBF model used here. Notably, Viswanath et al.’s²⁹ Random Forest model showed balanced performance (accuracy 84.84%, precision 90%, specificity 89%), yet it underperformed compared to the SVC-RBF model in this study, emphasizing the latter’s ability to capture non-linear relationships in high-dimensional datasets.

Hussain et al. (2024)³⁰ provide a comprehensive review of machine learning models for breast cancer risk prediction, analyzing key algorithms such as deep learning, decision trees, support vector machines, and ensemble learning. Their study highlights the significance of dataset selection, feature engineering, and model interpretability in improving predictive accuracy. While their work offers a broad overview of machine learning in cancer diagnostics, our study focuses specifically on the Support Vector Classifier with an RBF kernel (SVC-RBF), evaluating its robustness and optimization for cancer classification. Additionally, while Hussain et al. discuss challenges such as dataset bias and feature selection, we extend this discussion by assessing kernel-based optimization and hyperparameter tuning, which play a crucial role in improving predictive performance in imaging-based diagnostics. Similarly, Uthamacumaran et al. (2023)³² introduce a novel machine intelligence-driven classification approach for extracellular vesicles derived from cancer patients using fluorescence correlation spectroscopy (FCS). Their study emphasizes the potential of machine learning in non-invasive cancer diagnostics by combining FCS data with deep learning models and advanced feature extraction techniques. While their work focuses on biomarker-based classification, our study applies SVC-RBF to imaging datasets, exploring its efficiency in structured imaging data rather than fluorescence-based biomarker detection. Additionally, while their research explores deep learning techniques, our work investigates the interpretability and efficacy of kernel-based supervised learning approaches in cancer classification.

The SVC-RBF model offers significant advantages. Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. While the model operates as a black-box algorithm with limited inherent interpretability, its strong predictive capability makes it a valuable candidate for decision-support applications in clinical settings. To enhance clinician trust and eventual translatability, future work will focus on integrating model-agnostic interpretability techniques, such as SHAP values or feature attribution methods, to improve transparency and support clinical decision-making.^33,34 This study demonstrates the efficacy of machine learning techniques in the early detection and differential diagnosis of benign and malignant breast lesions, with the Support Vector Classifier using a Radial Basis Function (SVC-RBF) kernel emerging as the most accurate model. The additional statistical validation using five-fold cross-validation confirms that the observed classification performance was not the result of random variation. The relatively low standard deviation values obtained across all classifiers indicate consistent performance across multiple validation subsets.^35–37 Furthermore, the low RMSE values observed for Logistic Regression, SVC Linear, and SVC-RBF demonstrate reliable prediction capability and robust model generalization. These findings provide further evidence supporting the reproducibility and clinical applicability of the proposed machine learning framework for breast cancer classification.

4.1 Strength and limitations of the study

A major strength of this study is the systematic evaluation and comparison of multiple machine learning algorithms for breast cancer classification using a standardized benchmark dataset. Comprehensive exploratory data analysis, robust data preprocessing, feature selection, hyperparameter tuning, and five-fold cross-validation were performed to improve model reliability and reduce bias. The SVC-RBF model demonstrated excellent diagnostic performance, achieving high accuracy, sensitivity, specificity, F1-score, and ROC-AUC values, highlighting its potential utility in supporting early breast cancer diagnosis.

However, several limitations should be acknowledged. The study relied on the Wisconsin Breast Cancer Diagnostic dataset, which is limited in size and diversity and may restrict the generalizability of the findings to broader clinical populations. External validation using independent or multicentre datasets was not performed. Although the SVC-RBF model achieved excellent predictive performance, its black-box nature limits interpretability. Advanced dimensionality reduction techniques such as PCA or t-SNE were not explored, and formal overfitting assessments, including learning curve or bias-variance analyses, were not conducted. In addition, feature selection was primarily based on correlation analysis and exploratory visualization and may not fully capture complex feature interactions. Statistical significance testing for feature-level separability was also not performed. Future studies should incorporate larger and more diverse datasets, external validation, advanced explainable AI methods such as SHAP, ensemble learning approaches, and multimodal data sources to further improve model robustness, interpretability, and clinical applicability.

5. Conclusion

This study developed and evaluated multiple machine learning classifiers for the classification of benign and malignant breast lesions using the Wisconsin Breast Cancer Diagnostic dataset. Among the evaluated models, the Support Vector Classifier with Radial Basis Function (SVC-RBF) achieved the highest classification performance, with an accuracy of 98.68% and an AUC of 0.96, demonstrating its effectiveness for breast cancer detection. The findings indicate that appropriate preprocessing, feature selection, and model optimization can substantially improve diagnostic performance and support early disease identification.

Despite these promising results, the study was limited by the use of a single publicly available dataset and the absence of external validation. Future studies should focus on validating the proposed framework using larger and more diverse multicentre datasets. The incorporation of multimodal imaging data, explainable artificial intelligence techniques, and advanced deep learning approaches may further improve model performance and facilitate clinical implementation.

Ethics and consent

The dataset was extracted from the online open-source Wisconsin (Diagnostics) dataset. The study approval was obtained from Institutional Research Committee of Manipal College of Health Professions, Manipal on the 20^th of January 2022 (MCHP/Mpl/IRC/PG/2022/04). All procedures adhered to established ethical guidelines for secondary data analysis and data use policies. Consent is not applicable since the data was extracted from the online open source Wisconsin (Diagnostics) dataset.

Data availability

Kaggle: Wisconsin Breast Cancer Dataset, https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data .

The dataset used in this study is publicly available from the Wisconsin Breast Cancer Diagnostic dataset on Kaggle, and the analysis code for data preprocessing, model development, hyperparameter tuning, and evaluation is available at: https://github.com/rinsyrahman/breast-cancer-ml-analysis.

The data sets of mammography with benign and malignant breast lesions.

Data are available under the terms of the CC BY-NC-SA 4.0 (CC-BY 4.0).

References

1. Cancer site ranking.n.d.
2. Report of National Cancer Registry Programme, 2020 A scientific way to understand about Cancer.n.d.
3. Global Breast Cancer Initiative Implementation Framework Assessing, strengthening and scaling up services for the early detection and management of breast cancer.n.d.
4. Harbeck N, Penault-Llorca F, Cortes J, et al.: Breast cancer. Nat. Rev. Dis. Primers. 2019; 5. Publisher Full Text
5. Chen ZH, Lin L, Wu CF, et al.: Artificial intelligence for assisting cancer diagnosis and treatment in the era of precision medicine. Cancer Commun. 2021; 41: 1100–1115. PubMed Abstract | Publisher Full Text | Free Full Text
6. Bhise S, Bepari S, Gadekar S, et al.: Breast Cancer Detection using Machine Learning Techniques.n.d.
7. Luchini C, Pea A, Scarpa A: Artificial intelligence in oncology: current applications and future perspectives. Br. J. Cancer. 2022; 126: 4–9. PubMed Abstract | Publisher Full Text | Free Full Text
8. Liu J, Lei J, Ou Y, et al.: Mammography diagnosis of breast cancer screening through machine learning: a systematic review and meta-analysis. Clin. Exp. Med. 2023; 23: 2341–2356. PubMed Abstract | Publisher Full Text
9. Vaka AR, Soni B, Sudheer Reddy K: Breast cancer detection by leveraging Machine Learning. ICT Express. 2020; 6: 320–324. Publisher Full Text
10. Gupta G, Sharma M, Choudhary S, et al.: Performance Analysis of Machine Learning Classification Algorithms for Breast Cancer Diagnosis. 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), ICRITO 2021. Institute of Electrical and Electronics Engineers Inc.; 2021. Publisher Full Text
11. Ganesan J, Krishnan V, Rathinavel T, et al.: Enhancing breast cancer detection accuracy through machine learning, deep learning and transfer learning techniques for clinical practice. Discov. Artif. Intell. 2026; 6. Publisher Full Text
12. Dhivya P, Bazilabanu A, Ponniah T: Machine Learning Model for Breast Cancer Data Analysis Using Triplet Feature Selection Algorithm. IETE J. Res. 2023; 69: 1789–1799. Publisher Full Text
13. Garba AT, Hamza HS: Interpretable Machine Learning Approach for Breast Cancer Classification. Hum. Centric Intell. Syst. 2025; 5: 308–322. Publisher Full Text
14. Beghriche T, Brik Y, Djerioui M, et al.: A Multi-stage Optimization Architecture for Effective Breast Cancer Diagnosis Based on Deep Neural Networks. Arab. J. Sci. Eng. 2025; 50: 17943–17968. Publisher Full Text
15. Handa S, Chatterjee M, Jan N, Mir MAChapter 17 - Applications of AI and machine learning in breast cancer diagnosis and treatment. In: Genetic Testing in Breast Cancer.Academic Press;2026; pp. 337–355. Publisher Full Text
16. Zeid MA-E, AbdElminaam DS, Albeshri MY, et al.:From Diagnosis to Prognosis: Enhancing Breast Cancer Survival Predictions Through the Application of Machine Learning and Feature Selection Techniques. In:2025 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC).2025; pp. 420–427.
17. Ahmed KA, Humaira I, Khan AR, et al.: Advancing breast cancer prediction: Comparative analysis of ML models and deep learning-based multi-model ensembles on original and synthetic datasets. PLoS One. 2025; 20: e0326221. PubMed Abstract | Publisher Full Text | Free Full Text
18. Wolberg W; MO, SN, & SW: BCW (Diagnostic). [Dataset]. Breast Cancer Wisconsin. UCI Machine Learning Repository n.d. 1993.
19. Van Rossum G; DJFL: Python reference manual. Centrum Voor Wiskunde En Informatica Amsterdam.1995. van1995python.
20. The pandas development team: pandas-dev/pandas: Pandas. Zenodo. 2020.
21. van der Harris MWGVC , Wiese TB: SKPHKB, Haldane WPPJ-MSWAG. OR, Array programming with {NumPy}.2020.
22. Hunter JD: Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007; 9: 90–95. Publisher Full Text
23. Waskom ML: seaborn: statistical data visualization. J. Open Source Softw. 2021; 6: 3021. Publisher Full Text
24. Pedregosa F; VG and GA and MV and TB and GO and BM and PP and WR and DV: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12: 2825–2830.
25. Tahmooresi M, Afshar A, Rad BB, et al.: Early Detection of Breast Cancer Using Machine Learning Techniques.2019.
26. Kayode AA, Akande NO, Adegun AA, et al.: An automated mammogram classification system using modified support vector machine. Med. Devices (Auckl). 2019; 12: 275–284. PubMed Abstract | Publisher Full Text | Free Full Text
27. Shen L, Margolies LR, Rothstein JH, et al.: Deep Learning to Improve Breast Cancer Detection on Screening Mammography. Sci. Rep. 2019; 9: 12495. PubMed Abstract | Publisher Full Text | Free Full Text
28. Suh YJ, Jung J, Cho BJ: Automated breast cancer detection in digital mammograms of various densities via deep learning. J. Pers. Med. 2020; 10: 1–11. PubMed Abstract | Publisher Full Text | Free Full Text
29. Viswanath H, Guachi-Guachi L, Thirumuruganandham SP: EasyChair Preprint Breast Cancer Detection Using Image Processing Techniques and Classification Algorithms Breast Cancer Detection Using Image Processing Techniques and Classification Algorithms.2019.
30. Hussain S, Ali M, Naseem U, et al.: Breast cancer risk prediction using machine learning: a systematic review. Front. Oncol. 2024; 14: 14. PubMed Abstract | Publisher Full Text | Free Full Text
31. Debelee TG, Gebreselasie A, Schwenker F, et al.: Classification of mammograms using texture and CNN based extracted features. J. Biomim. Biomater. Biomed. Eng. 2019; 42: 79–97. Publisher Full Text
32. Uthamacumaran A, Abdouh M, Sengupta K, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Comput. Appl. 2023; 35(11): 8407–8422. Publisher Full Text
33. Sharma R, Madan P, Hariharan S, et al.: Hybrid Radial Basis Function and Support Vector Machine Model for Precise Breast Cancer Diagnosis. In:2024 International Conference on Computational Intelligence and Computing Applications (ICCICA).2024; pp. 35–38.
34. Mahmudah KR, Surono S, Rusmining IF: Impact of Different Kernels on Breast Cancer Severity Prediction Using Support Vector Machine. J. Electron. Electromed. Eng. Med. Inform. 2026; 8: 257–269. Publisher Full Text
35. Satria A, Sitompul OS, Mawengkang H:5-Fold Cross Validation on Supporting K-Nearest Neighbour Accuration of Making Consimilar Symptoms Disease Classification. In:2021 International Conference on Computer Science and Engineering (IC2SE).2021; pp. 1–5.
36. Abedin T, Xu H, Uddin S: The impact of K selection in K-fold cross-validation on bias and variance in supervised learning models. Sci. Rep. 2026; 16: 6084. PubMed Abstract | Publisher Full Text | Free Full Text
37. Gorriz JM, Martin-Clemente R, Segovia F, et al.: Is K-fold cross validation the best model selection method for machine learning? Inf. Fusion. 2026; 135: 104404. Publisher Full Text

Comments on this article Comments (0)

Version 6

VERSION 6 PUBLISHED 05 Feb 2025

Author details Author details

Rinsy Rahman
Roles: Conceptualization, Data Curation, Investigation, Methodology, Software, Validation, Visualization, Writing – Original Draft Preparation

Dola Saha
Roles: Conceptualization, Data Curation, Methodology, Supervision, Validation, Writing – Original Draft Preparation

Winniecia Dkhar
Roles: Conceptualization, Data Curation, Investigation, Supervision, Validation, Visualization, Writing – Review & Editing

Sathyendranath Malli
Roles: Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization

Neil Barnes Abraham
Roles: Data Curation, Formal Analysis, Software, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (6)

version 6

Revised

Published: 23 Jun 2026, 14:164

https://doi.org/10.12688/f1000research.161073.6

version 5

Revised

Published: 09 May 2026, 14:164

https://doi.org/10.12688/f1000research.161073.5

version 4

Revised

Published: 05 Sep 2025, 14:164

https://doi.org/10.12688/f1000research.161073.4

version 3

Revised

Published: 16 May 2025, 14:164

https://doi.org/10.12688/f1000research.161073.3

version 2

Revised

Published: 10 Apr 2025, 14:164

https://doi.org/10.12688/f1000research.161073.2

version 1

Published: 05 Feb 2025, 14:164

https://doi.org/10.12688/f1000research.161073.1

© 2026 Rahman R et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Rahman R, Saha D, Dkhar W et al. Development of a machine learning predictive model for early detection of breast cancer [version 6; peer review: 1 approved, 2 approved with reservations, 2 not approved]. F1000Research 2026, 14:164 (https://doi.org/10.12688/f1000research.161073.6)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 5

VERSION 5

PUBLISHED 09 May 2026

Revised

Views

Reviewer Report 15 Jun 2026

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

Not Approved

https://doi.org/10.5256/f1000research.195914.r483192

Thank you to the authors for their continuous improvement. I approve of all the recent changes and ensuring the ML performance metrics are as robust as possible. There are still a few concerns that need to be addressed:

1) Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we can just make the x axis Malignant and Benign for simplicity, for better readership.

2) Table 2 still doesn't report specificity correctly. The authors mentioned using the correct equation of TN/ (TN + FP). If we apply this to Table 1, let's take logistic regression for instance, specificity = 83/ (83+2) = 0.98. This is not what is reported in Table 2. There might also be some count number inconsistencies. The confusion matrix says 229, table 2 shows 228 as the sum. Check please.

3) Figure 4 is not a forest plot without CI (confidence intervals). I suggest renaming this to a performance-comparison dot plot or most robust and best would be to include cross-validation means with 95% CIs/error bars.

4) For the hyperparameter section, please confirm that calling and feature selection were fitted only on training folds, and no test dataset bias occurred. Just one statement is sufficient to add to ensure there was no data leakage in training.

5) ROC curves should be generated from probability scores, not predicted class labels. Please ensure this. It looks great, but just check please as we couldn't find the ROC curve decision functions in the GitHub code. Similarly, the best_model is never used in the code, GridSearchCV only applies to the SVM-RBF. The manuscript’s claim that all classifiers were similarly hyperparameter fine-tuned, remains unverified in the code. Please verify again.

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Cancer Research; Artificial Intelligence; Systems medicine

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

23 Jun 2026

Author Response

1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we ... Continue reading 1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we can just make the x axis Malignant and Benign for simplicity, for better readership.
Author:- Thank you for this observation. We have carefully reviewed Figure 1 and the corresponding text. The discrepancy has been corrected in the revised manuscript, and the figure has been updated accordingly. These changes will be reflected in the updated version of the manuscript.

2) Reviewer:- Table 2 still doesn't report specificity correctly. The authors mentioned using the correct equation of TN/ (TN + FP). If we apply this to Table 1, let's take logistic regression for instance, specificity = 83/ (83+2) = 0.98. This is not what is reported in Table 2. There might also be some count number inconsistencies. The confusion matrix says 229, table 2 shows 228 as the sum. Check please.
Author:- Thank you for identifying this issue. We have re-examined the specificity calculations and verified the confusion matrix values

3) Reviewer: Figure 4 is not a forest plot without CI (confidence intervals). I suggest renaming this to a performance-comparison dot plot or most robust and best would be to include cross-validation means with 95% CIs/error bars.
Author:-  Thank you for identifying this issue. The discrepancy has been corrected in the revised manuscript.  These changes will be reflected in the updated version of the manuscript.

4) Reviewer: For the hyperparameter section, please confirm that calling and feature selection were fitted only on training folds, and no test dataset bias occurred. Just one statement is sufficient to add to ensure there was no data leakage in training.
Author:- Thank you for highlighting this important methodological consideration. We confirm that feature selection and hyperparameter optimization were performed exclusively within the training folds during cross-validation.

5) Reviewer: ROC curves should be generated from probability scores, not predicted class labels. Please ensure this. It looks great, but just check please as we couldn't find the ROC curve decision functions in the GitHub code. Similarly, the best_model is never used in the code, GridSearchCV only applies to the SVM-RBF.  The manuscript’s claim that all classifiers were similarly hyperparameter fine-tuned, remains unverified in the code. Please verify again.
Author:- Thank you for this important observation. We have carefully reviewed the implementation and verified that the ROC curves were generated using predicted probability scores (or decision function outputs where applicable), rather than predicted class labels. We have clarified this in the revised manuscript.
1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we can just make the x axis Malignant and Benign for simplicity, for better readership.
Author:- Thank you for this observation. We have carefully reviewed Figure 1 and the corresponding text. The discrepancy has been corrected in the revised manuscript, and the figure has been updated accordingly. These changes will be reflected in the updated version of the manuscript.

2) Reviewer:- Table 2 still doesn't report specificity correctly. The authors mentioned using the correct equation of TN/ (TN + FP). If we apply this to Table 1, let's take logistic regression for instance, specificity = 83/ (83+2) = 0.98. This is not what is reported in Table 2. There might also be some count number inconsistencies. The confusion matrix says 229, table 2 shows 228 as the sum. Check please.
Author:- Thank you for identifying this issue. We have re-examined the specificity calculations and verified the confusion matrix values

3) Reviewer: Figure 4 is not a forest plot without CI (confidence intervals). I suggest renaming this to a performance-comparison dot plot or most robust and best would be to include cross-validation means with 95% CIs/error bars.
Author:-  Thank you for identifying this issue. The discrepancy has been corrected in the revised manuscript.  These changes will be reflected in the updated version of the manuscript.

4) Reviewer: For the hyperparameter section, please confirm that calling and feature selection were fitted only on training folds, and no test dataset bias occurred. Just one statement is sufficient to add to ensure there was no data leakage in training.
Author:- Thank you for highlighting this important methodological consideration. We confirm that feature selection and hyperparameter optimization were performed exclusively within the training folds during cross-validation.

5) Reviewer: ROC curves should be generated from probability scores, not predicted class labels. Please ensure this. It looks great, but just check please as we couldn't find the ROC curve decision functions in the GitHub code. Similarly, the best_model is never used in the code, GridSearchCV only applies to the SVM-RBF.  The manuscript’s claim that all classifiers were similarly hyperparameter fine-tuned, remains unverified in the code. Please verify again.
Author:- Thank you for this important observation. We have carefully reviewed the implementation and verified that the ROC curves were generated using predicted probability scores (or decision function outputs where applicable), rather than predicted class labels. We have clarified this in the revised manuscript.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

23 Jun 2026

Author Response

1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we ... Continue reading 1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we can just make the x axis Malignant and Benign for simplicity, for better readership.
Author:- Thank you for this observation. We have carefully reviewed Figure 1 and the corresponding text. The discrepancy has been corrected in the revised manuscript, and the figure has been updated accordingly. These changes will be reflected in the updated version of the manuscript.

2) Reviewer:- Table 2 still doesn't report specificity correctly. The authors mentioned using the correct equation of TN/ (TN + FP). If we apply this to Table 1, let's take logistic regression for instance, specificity = 83/ (83+2) = 0.98. This is not what is reported in Table 2. There might also be some count number inconsistencies. The confusion matrix says 229, table 2 shows 228 as the sum. Check please.
Author:- Thank you for identifying this issue. We have re-examined the specificity calculations and verified the confusion matrix values

3) Reviewer: Figure 4 is not a forest plot without CI (confidence intervals). I suggest renaming this to a performance-comparison dot plot or most robust and best would be to include cross-validation means with 95% CIs/error bars.
Author:-  Thank you for identifying this issue. The discrepancy has been corrected in the revised manuscript.  These changes will be reflected in the updated version of the manuscript.

4) Reviewer: For the hyperparameter section, please confirm that calling and feature selection were fitted only on training folds, and no test dataset bias occurred. Just one statement is sufficient to add to ensure there was no data leakage in training.
Author:- Thank you for highlighting this important methodological consideration. We confirm that feature selection and hyperparameter optimization were performed exclusively within the training folds during cross-validation.

5) Reviewer: ROC curves should be generated from probability scores, not predicted class labels. Please ensure this. It looks great, but just check please as we couldn't find the ROC curve decision functions in the GitHub code. Similarly, the best_model is never used in the code, GridSearchCV only applies to the SVM-RBF.  The manuscript’s claim that all classifiers were similarly hyperparameter fine-tuned, remains unverified in the code. Please verify again.
Author:- Thank you for this important observation. We have carefully reviewed the implementation and verified that the ROC curves were generated using predicted probability scores (or decision function outputs where applicable), rather than predicted class labels. We have clarified this in the revised manuscript.
1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we can just make the x axis Malignant and Benign for simplicity, for better readership.
Author:- Thank you for this observation. We have carefully reviewed Figure 1 and the corresponding text. The discrepancy has been corrected in the revised manuscript, and the figure has been updated accordingly. These changes will be reflected in the updated version of the manuscript.

2) Reviewer:- Table 2 still doesn't report specificity correctly. The authors mentioned using the correct equation of TN/ (TN + FP). If we apply this to Table 1, let's take logistic regression for instance, specificity = 83/ (83+2) = 0.98. This is not what is reported in Table 2. There might also be some count number inconsistencies. The confusion matrix says 229, table 2 shows 228 as the sum. Check please.
Author:- Thank you for identifying this issue. We have re-examined the specificity calculations and verified the confusion matrix values

3) Reviewer: Figure 4 is not a forest plot without CI (confidence intervals). I suggest renaming this to a performance-comparison dot plot or most robust and best would be to include cross-validation means with 95% CIs/error bars.
Author:-  Thank you for identifying this issue. The discrepancy has been corrected in the revised manuscript.  These changes will be reflected in the updated version of the manuscript.

4) Reviewer: For the hyperparameter section, please confirm that calling and feature selection were fitted only on training folds, and no test dataset bias occurred. Just one statement is sufficient to add to ensure there was no data leakage in training.
Author:- Thank you for highlighting this important methodological consideration. We confirm that feature selection and hyperparameter optimization were performed exclusively within the training folds during cross-validation.

5) Reviewer: ROC curves should be generated from probability scores, not predicted class labels. Please ensure this. It looks great, but just check please as we couldn't find the ROC curve decision functions in the GitHub code. Similarly, the best_model is never used in the code, GridSearchCV only applies to the SVM-RBF.  The manuscript’s claim that all classifiers were similarly hyperparameter fine-tuned, remains unverified in the code. Please verify again.
Author:- Thank you for this important observation. We have carefully reviewed the implementation and verified that the ROC curves were generated using predicted probability scores (or decision function outputs where applicable), rather than predicted class labels. We have clarified this in the revised manuscript.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 03 Jun 2026

Manna Debnath, Department of Medical Imaging Technology, Bapubhai Desaibhai Patel Institute of Paramedical Sciences, Charotar University of Science and Technology, Changa, Gujarat, India

Approved with Reservations

https://doi.org/10.5256/f1000research.195914.r484567

In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that

In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.
In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.
In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.
On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure
Add strengths and limitations of the study at the end of the discussion section.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Medical Radiology and Imaging Technology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

23 Jun 2026

Author Response

1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section ... Continue reading 1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.

Author Response: - We thank the reviewer for identifying this inconsistency. The dataset information was verified and corrected. The manuscript now consistently reports the dataset characteristics in both the Abstract and Methods sections.

2.
Reviewer Comments: - In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, we have revised the Introduction to include the limitations of previous machine learning studies.

3.
Reviewer Comments: - In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.

Author Response: - We thank the reviewer for identifying this issue. Upon verification, we confirm that the Wisconsin Breast Cancer Diagnostic dataset does not contain patient age information. The statement has been corrected in the revised manuscript, and any reference to age as a dataset variable has been removed to ensure accuracy and consistency.

4.
Reviewer Comments: - On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure.

Author Response: - We thank the reviewer for identifying this error. Upon verification of the dataset, we found that the percentages for malignant and benign cases had been incorrectly reported. The manuscript has been revised to reflect the correct distribution of the target variable, and the text has been aligned.

5.
Reviewer Comments: - Add strengths and limitations of the study at the end of the discussion section.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, a dedicated paragraph outlining the strengths and limitations of the study has been added at the end of the Discussion section to provide a balanced interpretation of the findings and highlight areas for future research.
1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.

Author Response: - We thank the reviewer for identifying this inconsistency. The dataset information was verified and corrected. The manuscript now consistently reports the dataset characteristics in both the Abstract and Methods sections.

2.
Reviewer Comments: - In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, we have revised the Introduction to include the limitations of previous machine learning studies.

3.
Reviewer Comments: - In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.

Author Response: - We thank the reviewer for identifying this issue. Upon verification, we confirm that the Wisconsin Breast Cancer Diagnostic dataset does not contain patient age information. The statement has been corrected in the revised manuscript, and any reference to age as a dataset variable has been removed to ensure accuracy and consistency.

4.
Reviewer Comments: - On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure.

Author Response: - We thank the reviewer for identifying this error. Upon verification of the dataset, we found that the percentages for malignant and benign cases had been incorrectly reported. The manuscript has been revised to reflect the correct distribution of the target variable, and the text has been aligned.

5.
Reviewer Comments: - Add strengths and limitations of the study at the end of the discussion section.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, a dedicated paragraph outlining the strengths and limitations of the study has been added at the end of the Discussion section to provide a balanced interpretation of the findings and highlight areas for future research.
Competing Interests: Nil Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

23 Jun 2026

Author Response

1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section ... Continue reading 1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.

Author Response: - We thank the reviewer for identifying this inconsistency. The dataset information was verified and corrected. The manuscript now consistently reports the dataset characteristics in both the Abstract and Methods sections.

2.
Reviewer Comments: - In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, we have revised the Introduction to include the limitations of previous machine learning studies.

3.
Reviewer Comments: - In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.

Author Response: - We thank the reviewer for identifying this issue. Upon verification, we confirm that the Wisconsin Breast Cancer Diagnostic dataset does not contain patient age information. The statement has been corrected in the revised manuscript, and any reference to age as a dataset variable has been removed to ensure accuracy and consistency.

4.
Reviewer Comments: - On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure.

Author Response: - We thank the reviewer for identifying this error. Upon verification of the dataset, we found that the percentages for malignant and benign cases had been incorrectly reported. The manuscript has been revised to reflect the correct distribution of the target variable, and the text has been aligned.

5.
Reviewer Comments: - Add strengths and limitations of the study at the end of the discussion section.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, a dedicated paragraph outlining the strengths and limitations of the study has been added at the end of the Discussion section to provide a balanced interpretation of the findings and highlight areas for future research.
1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.

Author Response: - We thank the reviewer for identifying this inconsistency. The dataset information was verified and corrected. The manuscript now consistently reports the dataset characteristics in both the Abstract and Methods sections.

2.
Reviewer Comments: - In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, we have revised the Introduction to include the limitations of previous machine learning studies.

3.
Reviewer Comments: - In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.

Author Response: - We thank the reviewer for identifying this issue. Upon verification, we confirm that the Wisconsin Breast Cancer Diagnostic dataset does not contain patient age information. The statement has been corrected in the revised manuscript, and any reference to age as a dataset variable has been removed to ensure accuracy and consistency.

4.
Reviewer Comments: - On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure.

Author Response: - We thank the reviewer for identifying this error. Upon verification of the dataset, we found that the percentages for malignant and benign cases had been incorrectly reported. The manuscript has been revised to reflect the correct distribution of the target variable, and the text has been aligned.

5.
Reviewer Comments: - Add strengths and limitations of the study at the end of the discussion section.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, a dedicated paragraph outlining the strengths and limitations of the study has been added at the end of the Discussion section to provide a balanced interpretation of the findings and highlight areas for future research.
Competing Interests: Nil Close
Report a concern

Views

Reviewer Report 02 Jun 2026

Musatafa Abbas Abbood Albadr, Basrah University for Oil and Gas, Al Basrah, Iraq

Approved with Reservations

https://doi.org/10.5256/f1000research.195914.r487176

A major revision is required and the authors need to carefully address all the comments.

1. Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

2. I advise the authors to write the “1. Introduction” as follow: start with a brief introduction of the Breast cancer, then discuss the effectiveness of ML and DL techniques in the health care generally and specifically in Breast cancer detection, and deeply deliberate about the issues that the authors want to tackle in the present work (support that with recent references). Subsequently, highlight the strength of the proposed techniques and justify why they are proposing them for the Breast cancer detection. Conclude with the main goals of the current work presented as bullet points. Immediately following the bullet points, outline the organization of the paper.

3. The authors need to add a related work section and deeply elaborate each work of the related works. Besides, they are required to summarize the related works in a table alongside with their weaknesses. The table can be as follow:
First column: Reference, Second column: Dataset, Third column: Features Extraction, Fourth column: Model, Fifth column: Results, and Sixth column: Weaknesses.

4. The authors need to evaluate their proposed work on more evaluation measurements such as G-Mean and MCC.

5. The authors need to add the equations for the evaluation measurements alongside with their references.

6. In “3. Results” section, the authors need to provide the description of the environment that they have conducted their experiments on.

7. The authors are required to statistically evaluate their proposed work based on Mean, RMSE, and STD of the evaluation measurements in order to make sure that the achieved results were not by accident or chance.

8. The authors must add the explanation of the statistical evaluation measurements alongside with their equations.
For conducting/reporting/plotting/discussing the statistical outcomes, the authors can have a look at the following paper in order to get an idea on how to conduct/report/plotting/discuss the statistical results. “Albadr, M. A. A., Ayob, M., Tiun, S., AL-Dhief, F. T., Arram, A., & Khalaf, S. (2023). Breast cancer diagnosis using the fast learning network algorithm. Frontiers in Oncology, 13, 1150840”.

9. The authors are required to compare the performance of the proposed work against some recent studies that have used the same dataset in terms of accuracy.

10. The authors must write the conclusion in more effective way alongside with the limitations of the current work and real future works.

Is the work clearly and accurately presented and does it cite the current literature?

No
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Machine learning, artificial neural networks, deep learning, optimization, speech processing, healthcare technologies, image processing, and steganography techniques.

CITE

Report a concern

Author Response 23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

23 Jun 2026

Author Response

Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: ... Continue reading Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: - Thank you for the suggestion, changes have been made accordingly in manuscript.

2.
Reviewer Comments: I advise the authors to write the “1. Introduction” as follow: start with a brief introduction of the Breast cancer, then discuss the effectiveness of ML and DL techniques in the health care generally and specifically in Breast cancer detection and deeply deliberate about the issues that the authors want to tackle in the present work (support that with recent references). Subsequently, highlight the strength of the proposed techniques and justify why they are proposing them for the Breast cancer detection. Conclude with the main goals of the current work presented as bullet points. Immediately following the bullet points, outline the organization of the paper.

Author Response: - Thank you for the valuable suggestion. The Introduction has been revised to improve its structure, incorporate recent literature, highlight the rationale for the proposed classifiers, and clearly state the study objectives and manuscript organisation.

3.
Reviewer Comments: The authors need to add a related work section and deeply elaborate each work of the related works. Besides, they are required to summarize the related works in a table alongside with their weaknesses. The table can be as follow:
First column: Reference, Second column: Dataset, Third column: Features Extraction, Fourth column: Model, Fifth column: Results, and Sixth column: Weaknesses.

Author Response: Thank you for the suggestion. A comparative table summarizing related studies and their limitations has been added, along with additional discussion to position the present work within the existing literature.

4.
Reviewer Comments: The authors need to evaluate their proposed work on more evaluation measurements such as G-Mean and MCC.

Author Response: Thank you for this valuable suggestion. We have included two additional performance measures, namely Geometric Mean (G-Mean) and Matthews Correlation Coefficient (MCC), to provide a more comprehensive assessment of classifier performance. These metrics are particularly useful in evaluating classification models where balanced performance across classes is important.

5.
Reviewer Comments: The authors need to add the equations for the evaluation measurements alongside with their references.

Author Response: Thank you for the suggestion. The mathematical equations for accuracy, precision, sensitivity, specificity, and F1-score have been added to the Performance Evaluation section along with appropriate references.

6.
Reviewer Comments: In “3. Results” section, the authors need to provide the description of the environment that they have conducted their experiments on.

Author Response: Thank you for the suggestion. Details of the experimental environment, including software tools, hardware specifications, and computing platform, have been added to the manuscript to improve the reproducibility of the study.

7.
Reviewer Comments: The authors are required to statistically evaluate their proposed work based on Mean, RMSE, and STD of the evaluation measurements in order to make sure that the achieved results were not by accident or chance.

Author Response: Thank you for the suggestion. Mean Accuracy, Standard Deviation (STD), and RMSE obtained through five-fold cross-validation have been added to the Results section to further validate the robustness and reliability of the proposed classifiers

8.
Reviewer Comments: The authors must add the explanation of the statistical evaluation measurements alongside with their equations.
For conducting/reporting/plotting/discussing the statistical outcomes, the authors can have a look at the following paper in order to get an idea on how to conduct/report/plotting/discuss the statistical results. “Albadr, M. A. A., Ayob, M., Tiun, S., AL-Dhief, F. T., Arram, A., & Khalaf, S. (2023). Breast cancer diagnosis using the fast-learning network algorithm. Frontiers in Oncology, 13, 1150840”.

Author Response: Thank you for the suggestion. Explanations and equations for the evaluation metrics, including accuracy, precision, sensitivity, specificity, F1-score, and RMSE, have been added. The statistical outcomes from five-fold cross-validation have also been reported and discussed.

9.
Reviewer Comments: The authors are required to compare the performance of the proposed work against some recent studies that have used the same dataset in terms of accuracy.

Author Response: Thank you for the suggestion. A direct comparison with studies that utilized the Wisconsin Breast Cancer Dataset has been added to the Discussion section. The classification accuracy achieved by the proposed model is compared with previously reported results to highlight its relative performance.

10.
Reviewer Comments: The authors must write the conclusion in more effective way alongside with the limitations of the current work and real future works.

Author Response: Thank you for the valuable suggestion. The Conclusion section has been revised to provide a clearer summary of the main findings of the study. In addition, the key limitations of the current work and realistic future research directions have been explicitly discussed.
Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: - Thank you for the suggestion, changes have been made accordingly in manuscript.

2.
Reviewer Comments: I advise the authors to write the “1. Introduction” as follow: start with a brief introduction of the Breast cancer, then discuss the effectiveness of ML and DL techniques in the health care generally and specifically in Breast cancer detection and deeply deliberate about the issues that the authors want to tackle in the present work (support that with recent references). Subsequently, highlight the strength of the proposed techniques and justify why they are proposing them for the Breast cancer detection. Conclude with the main goals of the current work presented as bullet points. Immediately following the bullet points, outline the organization of the paper.

Author Response: - Thank you for the valuable suggestion. The Introduction has been revised to improve its structure, incorporate recent literature, highlight the rationale for the proposed classifiers, and clearly state the study objectives and manuscript organisation.

3.
Reviewer Comments: The authors need to add a related work section and deeply elaborate each work of the related works. Besides, they are required to summarize the related works in a table alongside with their weaknesses. The table can be as follow:
First column: Reference, Second column: Dataset, Third column: Features Extraction, Fourth column: Model, Fifth column: Results, and Sixth column: Weaknesses.

Author Response: Thank you for the suggestion. A comparative table summarizing related studies and their limitations has been added, along with additional discussion to position the present work within the existing literature.

4.
Reviewer Comments: The authors need to evaluate their proposed work on more evaluation measurements such as G-Mean and MCC.

Author Response: Thank you for this valuable suggestion. We have included two additional performance measures, namely Geometric Mean (G-Mean) and Matthews Correlation Coefficient (MCC), to provide a more comprehensive assessment of classifier performance. These metrics are particularly useful in evaluating classification models where balanced performance across classes is important.

5.
Reviewer Comments: The authors need to add the equations for the evaluation measurements alongside with their references.

Author Response: Thank you for the suggestion. The mathematical equations for accuracy, precision, sensitivity, specificity, and F1-score have been added to the Performance Evaluation section along with appropriate references.

6.
Reviewer Comments: In “3. Results” section, the authors need to provide the description of the environment that they have conducted their experiments on.

Author Response: Thank you for the suggestion. Details of the experimental environment, including software tools, hardware specifications, and computing platform, have been added to the manuscript to improve the reproducibility of the study.

7.
Reviewer Comments: The authors are required to statistically evaluate their proposed work based on Mean, RMSE, and STD of the evaluation measurements in order to make sure that the achieved results were not by accident or chance.

Author Response: Thank you for the suggestion. Mean Accuracy, Standard Deviation (STD), and RMSE obtained through five-fold cross-validation have been added to the Results section to further validate the robustness and reliability of the proposed classifiers

8.
Reviewer Comments: The authors must add the explanation of the statistical evaluation measurements alongside with their equations.
For conducting/reporting/plotting/discussing the statistical outcomes, the authors can have a look at the following paper in order to get an idea on how to conduct/report/plotting/discuss the statistical results. “Albadr, M. A. A., Ayob, M., Tiun, S., AL-Dhief, F. T., Arram, A., & Khalaf, S. (2023). Breast cancer diagnosis using the fast-learning network algorithm. Frontiers in Oncology, 13, 1150840”.

Author Response: Thank you for the suggestion. Explanations and equations for the evaluation metrics, including accuracy, precision, sensitivity, specificity, F1-score, and RMSE, have been added. The statistical outcomes from five-fold cross-validation have also been reported and discussed.

9.
Reviewer Comments: The authors are required to compare the performance of the proposed work against some recent studies that have used the same dataset in terms of accuracy.

Author Response: Thank you for the suggestion. A direct comparison with studies that utilized the Wisconsin Breast Cancer Dataset has been added to the Discussion section. The classification accuracy achieved by the proposed model is compared with previously reported results to highlight its relative performance.

10.
Reviewer Comments: The authors must write the conclusion in more effective way alongside with the limitations of the current work and real future works.

Author Response: Thank you for the valuable suggestion. The Conclusion section has been revised to provide a clearer summary of the main findings of the study. In addition, the key limitations of the current work and realistic future research directions have been explicitly discussed.
Competing Interests: Nil Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

23 Jun 2026

Author Response

Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: ... Continue reading Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: - Thank you for the suggestion, changes have been made accordingly in manuscript.

2.
Reviewer Comments: I advise the authors to write the “1. Introduction” as follow: start with a brief introduction of the Breast cancer, then discuss the effectiveness of ML and DL techniques in the health care generally and specifically in Breast cancer detection and deeply deliberate about the issues that the authors want to tackle in the present work (support that with recent references). Subsequently, highlight the strength of the proposed techniques and justify why they are proposing them for the Breast cancer detection. Conclude with the main goals of the current work presented as bullet points. Immediately following the bullet points, outline the organization of the paper.

Author Response: - Thank you for the valuable suggestion. The Introduction has been revised to improve its structure, incorporate recent literature, highlight the rationale for the proposed classifiers, and clearly state the study objectives and manuscript organisation.

3.
Reviewer Comments: The authors need to add a related work section and deeply elaborate each work of the related works. Besides, they are required to summarize the related works in a table alongside with their weaknesses. The table can be as follow:
First column: Reference, Second column: Dataset, Third column: Features Extraction, Fourth column: Model, Fifth column: Results, and Sixth column: Weaknesses.

Author Response: Thank you for the suggestion. A comparative table summarizing related studies and their limitations has been added, along with additional discussion to position the present work within the existing literature.

4.
Reviewer Comments: The authors need to evaluate their proposed work on more evaluation measurements such as G-Mean and MCC.

Author Response: Thank you for this valuable suggestion. We have included two additional performance measures, namely Geometric Mean (G-Mean) and Matthews Correlation Coefficient (MCC), to provide a more comprehensive assessment of classifier performance. These metrics are particularly useful in evaluating classification models where balanced performance across classes is important.

5.
Reviewer Comments: The authors need to add the equations for the evaluation measurements alongside with their references.

Author Response: Thank you for the suggestion. The mathematical equations for accuracy, precision, sensitivity, specificity, and F1-score have been added to the Performance Evaluation section along with appropriate references.

6.
Reviewer Comments: In “3. Results” section, the authors need to provide the description of the environment that they have conducted their experiments on.

Author Response: Thank you for the suggestion. Details of the experimental environment, including software tools, hardware specifications, and computing platform, have been added to the manuscript to improve the reproducibility of the study.

7.
Reviewer Comments: The authors are required to statistically evaluate their proposed work based on Mean, RMSE, and STD of the evaluation measurements in order to make sure that the achieved results were not by accident or chance.

Author Response: Thank you for the suggestion. Mean Accuracy, Standard Deviation (STD), and RMSE obtained through five-fold cross-validation have been added to the Results section to further validate the robustness and reliability of the proposed classifiers

8.
Reviewer Comments: The authors must add the explanation of the statistical evaluation measurements alongside with their equations.
For conducting/reporting/plotting/discussing the statistical outcomes, the authors can have a look at the following paper in order to get an idea on how to conduct/report/plotting/discuss the statistical results. “Albadr, M. A. A., Ayob, M., Tiun, S., AL-Dhief, F. T., Arram, A., & Khalaf, S. (2023). Breast cancer diagnosis using the fast-learning network algorithm. Frontiers in Oncology, 13, 1150840”.

Author Response: Thank you for the suggestion. Explanations and equations for the evaluation metrics, including accuracy, precision, sensitivity, specificity, F1-score, and RMSE, have been added. The statistical outcomes from five-fold cross-validation have also been reported and discussed.

9.
Reviewer Comments: The authors are required to compare the performance of the proposed work against some recent studies that have used the same dataset in terms of accuracy.

Author Response: Thank you for the suggestion. A direct comparison with studies that utilized the Wisconsin Breast Cancer Dataset has been added to the Discussion section. The classification accuracy achieved by the proposed model is compared with previously reported results to highlight its relative performance.

10.
Reviewer Comments: The authors must write the conclusion in more effective way alongside with the limitations of the current work and real future works.

Author Response: Thank you for the valuable suggestion. The Conclusion section has been revised to provide a clearer summary of the main findings of the study. In addition, the key limitations of the current work and realistic future research directions have been explicitly discussed.
Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: - Thank you for the suggestion, changes have been made accordingly in manuscript.

2.
Reviewer Comments: I advise the authors to write the “1. Introduction” as follow: start with a brief introduction of the Breast cancer, then discuss the effectiveness of ML and DL techniques in the health care generally and specifically in Breast cancer detection and deeply deliberate about the issues that the authors want to tackle in the present work (support that with recent references). Subsequently, highlight the strength of the proposed techniques and justify why they are proposing them for the Breast cancer detection. Conclude with the main goals of the current work presented as bullet points. Immediately following the bullet points, outline the organization of the paper.

Author Response: - Thank you for the valuable suggestion. The Introduction has been revised to improve its structure, incorporate recent literature, highlight the rationale for the proposed classifiers, and clearly state the study objectives and manuscript organisation.

3.
Reviewer Comments: The authors need to add a related work section and deeply elaborate each work of the related works. Besides, they are required to summarize the related works in a table alongside with their weaknesses. The table can be as follow:
First column: Reference, Second column: Dataset, Third column: Features Extraction, Fourth column: Model, Fifth column: Results, and Sixth column: Weaknesses.

Author Response: Thank you for the suggestion. A comparative table summarizing related studies and their limitations has been added, along with additional discussion to position the present work within the existing literature.

4.
Reviewer Comments: The authors need to evaluate their proposed work on more evaluation measurements such as G-Mean and MCC.

Author Response: Thank you for this valuable suggestion. We have included two additional performance measures, namely Geometric Mean (G-Mean) and Matthews Correlation Coefficient (MCC), to provide a more comprehensive assessment of classifier performance. These metrics are particularly useful in evaluating classification models where balanced performance across classes is important.

5.
Reviewer Comments: The authors need to add the equations for the evaluation measurements alongside with their references.

Author Response: Thank you for the suggestion. The mathematical equations for accuracy, precision, sensitivity, specificity, and F1-score have been added to the Performance Evaluation section along with appropriate references.

6.
Reviewer Comments: In “3. Results” section, the authors need to provide the description of the environment that they have conducted their experiments on.

Author Response: Thank you for the suggestion. Details of the experimental environment, including software tools, hardware specifications, and computing platform, have been added to the manuscript to improve the reproducibility of the study.

7.
Reviewer Comments: The authors are required to statistically evaluate their proposed work based on Mean, RMSE, and STD of the evaluation measurements in order to make sure that the achieved results were not by accident or chance.

Author Response: Thank you for the suggestion. Mean Accuracy, Standard Deviation (STD), and RMSE obtained through five-fold cross-validation have been added to the Results section to further validate the robustness and reliability of the proposed classifiers

8.
Reviewer Comments: The authors must add the explanation of the statistical evaluation measurements alongside with their equations.
For conducting/reporting/plotting/discussing the statistical outcomes, the authors can have a look at the following paper in order to get an idea on how to conduct/report/plotting/discuss the statistical results. “Albadr, M. A. A., Ayob, M., Tiun, S., AL-Dhief, F. T., Arram, A., & Khalaf, S. (2023). Breast cancer diagnosis using the fast-learning network algorithm. Frontiers in Oncology, 13, 1150840”.

Author Response: Thank you for the suggestion. Explanations and equations for the evaluation metrics, including accuracy, precision, sensitivity, specificity, F1-score, and RMSE, have been added. The statistical outcomes from five-fold cross-validation have also been reported and discussed.

9.
Reviewer Comments: The authors are required to compare the performance of the proposed work against some recent studies that have used the same dataset in terms of accuracy.

Author Response: Thank you for the suggestion. A direct comparison with studies that utilized the Wisconsin Breast Cancer Dataset has been added to the Discussion section. The classification accuracy achieved by the proposed model is compared with previously reported results to highlight its relative performance.

10.
Reviewer Comments: The authors must write the conclusion in more effective way alongside with the limitations of the current work and real future works.

Author Response: Thank you for the valuable suggestion. The Conclusion section has been revised to provide a clearer summary of the main findings of the study. In addition, the key limitations of the current work and realistic future research directions have been explicitly discussed.
Competing Interests: Nil Close
Report a concern

Version 4

VERSION 4

PUBLISHED 05 Sep 2025

Revised

Views

Reviewer Report 24 Sep 2025

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

Not Approved

https://doi.org/10.5256/f1000research.187487.r411913

Thank you, authors, for your revision and your good intentions to contribute to humanity. Overall. the manuscript is clearly written. The revision includes the addition of methodological details (e.g., justification of hyperparameter tuning, explicit highlights on the limitations) to improve readership's clarity and reproducibility. The results (SVC-RBF with ~99% accuracy, AUC = 0.96) are strong and could be repurposed to clinically-relevant utility.

However, upon reviewing the code, it has come to my attention that the paper might not actually be demonstrating what the authors claim. As observed in the code the final SVC settings match the "default" parameters, and no actual "GridSearch" code seen in the revealed code. It looks like they just ran with defaults while claiming "tuning" with five-fold cross validation, which weakens rigor and further confirms the skepticism on the extremely high accuracy values.
Unfortunately, this makes the work more prone to overfitting and less convincing than if a transparent parameter search had been documented.

I will still leave it to the benefit of the doubt, and respectfully allow you to clarify this in the revision.

Please also kindly make the code you provided me in the response letter, available in some repository and make the link available with the work. You only provided the code as an author response, and this has revealed the flaws of the claims made.

Another minor note. The authors explicitly state that specificity was “calculated using Scikit-learn’s classification report function,” though in reality that function doesn't compute specificity by default. It only generates only precision/recall/F1/support. Please update the code and text as appropriately. Specificity should be derived from the confusion matrix (TN/(TN+FP)) or via recall_score(..., pos_label=0).

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Cancer Research; Artificial Intelligence; Systems medicine

CITE

Report a concern

Author Response 09 May 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

09 May 2026

Author Response

We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the ... Continue reading We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the Support Vector Machine (SVM) classifier.

The following hyperparameters were explored during tuning:

C: [0.1, 1, 10, 100]

gamma: ['scale', 0.01, 0.001]

kernel: ['rbf']

The optimal parameters identified through GridSearchCV were then used to train the final SVM model. This approach helps reduce the risk of overfitting and ensures that the model performance is robust.
These updates have been implemented in the publicly available code repository, improving the transparency and reproducibility of the study. Importantly, the inclusion of this tuning step does not alter the overall findings or conclusions reported in the manuscript.
The revised code is available at: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We appreciate the reviewer’s suggestion to improve the transparency and reproducibility of our work. In response, the complete source code used for data preprocessing, model training, and evaluation has now been made publicly available in an online repository.

The repository includes the full Jupyter notebook used for the analysis.

Repository link: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We thank the reviewer for this helpful clarification. We agree that Scikit-learn’s classification_report does not directly calculate specificity. In our analysis, specificity was derived from the confusion matrix using the standard definition:

Specificity = TN / (TN + FP)

where TN represents true negatives and FP represents false positives.
We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the Support Vector Machine (SVM) classifier.

The following hyperparameters were explored during tuning:

C: [0.1, 1, 10, 100]

gamma: ['scale', 0.01, 0.001]

kernel: ['rbf']

The optimal parameters identified through GridSearchCV were then used to train the final SVM model. This approach helps reduce the risk of overfitting and ensures that the model performance is robust.
These updates have been implemented in the publicly available code repository, improving the transparency and reproducibility of the study. Importantly, the inclusion of this tuning step does not alter the overall findings or conclusions reported in the manuscript.
The revised code is available at: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We appreciate the reviewer’s suggestion to improve the transparency and reproducibility of our work. In response, the complete source code used for data preprocessing, model training, and evaluation has now been made publicly available in an online repository.

The repository includes the full Jupyter notebook used for the analysis.

Repository link: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We thank the reviewer for this helpful clarification. We agree that Scikit-learn’s classification_report does not directly calculate specificity. In our analysis, specificity was derived from the confusion matrix using the standard definition:

Specificity = TN / (TN + FP)

where TN represents true negatives and FP represents false positives.
Competing Interests: Nil Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 09 May 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

09 May 2026

Author Response

We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the ... Continue reading We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the Support Vector Machine (SVM) classifier.

The following hyperparameters were explored during tuning:

C: [0.1, 1, 10, 100]

gamma: ['scale', 0.01, 0.001]

kernel: ['rbf']

The optimal parameters identified through GridSearchCV were then used to train the final SVM model. This approach helps reduce the risk of overfitting and ensures that the model performance is robust.
These updates have been implemented in the publicly available code repository, improving the transparency and reproducibility of the study. Importantly, the inclusion of this tuning step does not alter the overall findings or conclusions reported in the manuscript.
The revised code is available at: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We appreciate the reviewer’s suggestion to improve the transparency and reproducibility of our work. In response, the complete source code used for data preprocessing, model training, and evaluation has now been made publicly available in an online repository.

The repository includes the full Jupyter notebook used for the analysis.

Repository link: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We thank the reviewer for this helpful clarification. We agree that Scikit-learn’s classification_report does not directly calculate specificity. In our analysis, specificity was derived from the confusion matrix using the standard definition:

Specificity = TN / (TN + FP)

where TN represents true negatives and FP represents false positives.
We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the Support Vector Machine (SVM) classifier.

The following hyperparameters were explored during tuning:

C: [0.1, 1, 10, 100]

gamma: ['scale', 0.01, 0.001]

kernel: ['rbf']

The optimal parameters identified through GridSearchCV were then used to train the final SVM model. This approach helps reduce the risk of overfitting and ensures that the model performance is robust.
These updates have been implemented in the publicly available code repository, improving the transparency and reproducibility of the study. Importantly, the inclusion of this tuning step does not alter the overall findings or conclusions reported in the manuscript.
The revised code is available at: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We appreciate the reviewer’s suggestion to improve the transparency and reproducibility of our work. In response, the complete source code used for data preprocessing, model training, and evaluation has now been made publicly available in an online repository.

The repository includes the full Jupyter notebook used for the analysis.

Repository link: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We thank the reviewer for this helpful clarification. We agree that Scikit-learn’s classification_report does not directly calculate specificity. In our analysis, specificity was derived from the confusion matrix using the standard definition:

Specificity = TN / (TN + FP)

where TN represents true negatives and FP represents false positives.
Competing Interests: Nil Close
Report a concern

Views

Reviewer Report 19 Sep 2025

Chandrakanta Mahanty, GITAM Deemed to Be University, Visakhapatnam, India

Approved

https://doi.org/10.5256/f1000research.187487.r411911

Authors have modified the article as ... Continue reading

CITE

Report a concern

Respond or Comment

Version 3

VERSION 3

PUBLISHED 16 May 2025

Revised

Views

Reviewer Report 22 Aug 2025

Chandrakanta Mahanty, GITAM Deemed to Be University, Visakhapatnam, India

Approved with Reservations

https://doi.org/10.5256/f1000research.182136.r392084

Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Given

Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Mahanty C, Rajesh T, Govil N, Venkateswarulu N, et al.: Effective Alzheimer’s disease detection using enhanced Xception blending with snapshot ensemble. Scientific Reports. 2024; 14 (1). Publisher Full Text
2. Malik S, Patro S, Mahanty C, Lasisi A, et al.: Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering. Scientific Reports. 2024; 14 (1). Publisher Full Text
3. Altameem A, Mahanty C, Poonia R, Saudagar A, et al.: Breast Cancer Detection in Mammography Images Using Deep Convolutional Neural Networks and Fuzzy Ensemble Modeling Techniques. Diagnostics. 2022; 12 (8). Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Deep Learning, machine learning

CITE

Report a concern

Author Response 10 Sep 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

10 Sep 2025

Author Response
Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, ... Continue reading
Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Author :-We thank the reviewer for the comment. Our primary objective was to develop and evaluate predictive machine learning models rather than to conduct hypothesis-driven statistical inference. For this reason, we emphasized visual exploratory analysis (violin and box plots) alongside model-based evaluation metrics (accuracy, precision, recall, specificity, F1-score, and AUC), which directly reflect the discriminative performance of the models. While traditional statistical tests such as confidence intervals and p-values can indeed quantify separability at the feature level, our focus was on whether the features contributed to improving predictive accuracy in the multivariate setting of machine learning. We acknowledge this as a limitation and have added it to the manuscript.

Reviewer :-Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
Author :-We thank the reviewer for the observation. Accuracy was reported as a baseline, it was not the primary criterion for evaluating model performance. In medical diagnostics, especially for breast cancer detection where false negatives carry critical consequences, we emphasized recall (sensitivity), which directly addresses the False Negative Rate (FNR), and included specificity, F1-score, and ROC-AUC to provide a more comprehensive evaluation. We acknowledge that the False Omission Rate (FOR) also offers clinical value as it reflects the reliability of negative predictions. Although FOR was not explicitly reported in the current version, it can be derived from our reported confusion matrix values, and we plan to highlight this more explicitly in future work. We have clarified this in the limitations section of the manuscript.

Reviewer :- What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
Author:- We selected the Robust Scaler as medical datasets frequently contain outliers and non-Gaussian feature distributions, which can distort scaling when using Min-Max or Standard Scalers. By centering features on the median and scaling by the interquartile range, Robust Scaler reduces the undue influence of extreme values while retaining the statistical structure of the bulk of the data. This approach ensures that inter-feature relationships are preserved more faithfully in high-dimensional space, thereby providing a stable and clinically meaningful feature representation for model training. In the Method Section

Reviewer :- While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Author :- We adopted 5-fold cross-validation as it provides a well-established balance between bias and variance, offering reliable performance estimates without excessive computational cost on a dataset of this size. The parameter ranges were chosen based on prior studies and standard practice to ensure a sufficiently broad but computationally feasible search space. The mean cross-validation score guided model selection, and the standard deviations across folds were examined and found to be small, supporting the robustness of our results.

Reviewer :-Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
Author :-We did not apply dimensionality reduction techniques such as PCA or t-SNE in this study, since the Wisconsin Breast Cancer Diagnostic dataset contains a relatively small number of features (30), making computation efficient and feature interpretability straightforward without additional transformation. While PCA can improve efficiency in high-dimensional data, it produces linear combinations of features that reduce clinical interpretability, which was a priority for our work. Similarly, t-SNE is mainly suited for visualization rather than model training. Changes made In the Method Section

Reviewer :- How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Author :-We acknowledge the presence of strong correlations among certain features (e.g., 0.86 between concavity worst and concave points worst). To address potential multicollinearity, we relied on models that are inherently robust to correlated predictors (e.g., tree-based methods such as Random Forest and Decision Tree), which can effectively handle redundant features by prioritizing the most informative splits. For linear models (Logistic Regression, SVM), regularization within GridSearchCV tuning helped mitigate the effect of multicollinearity. Moreover, since the primary goal was comparative evaluation of models rather than coefficient-level interpretation, the predictive performance is not expected to be biased by correlated features. Changes made In the Method Section

Reviewer :- Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
Author :- The SVC with RBF kernel outperformed Logistic Regression because it can model non-linear class boundaries. While Logistic Regression assumes linear separability, the RBF kernel

K(xi,xj)=exp(−γ∥xi−xj∥2)

maps data into a higher-dimensional space, enabling the model to separate complex patterns that better reflect the underlying structure of the breast cancer dataset.
-

Reviewer :- What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Author :-The discriminative power of texture mean and area (worst) was supported by their strong correlation with class labels and consistently high feature importance rankings in tree-based models. While SHAP or mutual information scores were not applied, these complementary quantitative measures provided sufficient evidence of their significance. Changes made In the Result Section

Reviewer :-Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Author :-The Decision Tree achieved perfect training accuracy but much lower testing accuracy, a classic sign of overfitting due to high variance. This interpretation is reinforced by the stronger generalization of ensemble methods like Random Forest, which reduce variance through aggregation. Changes added in the limitation section

Reviewer :- Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?
Author :- We thank the reviewer for this valuable comment. We have clarified in the Discussion that an AUC of 0.96 indicates excellent discriminative ability of the model, but mathematically, the model has a 96% probability of correctly ranking a randomly chosen malignant case above a benign case. Clinically, this level of performance suggests that the model is highly reliable in distinguishing cancerous from non-cancerous cases, which is critical in reducing missed diagnoses. Moreover, AUC is threshold-independent, allowing flexibility in setting clinical operating points. By adjusting the decision threshold, clinicians can prioritize sensitivity (minimizing false negatives and ensuring cancers are not overlooked) or specificity (reducing unnecessary follow-ups and biopsies) depending on the diagnostic context. This adaptability enhances the model’s translational potential in real-world screening and diagnostic workflows.
-

Reviewer :-

Mahanty C, Rajesh T, Govil N, Venkateswarulu N, et al.: Effective Alzheimer’s disease detection using enhanced Xception blending with snapshot ensemble. Scientific Reports. 2024; 14 (1). Publisher Full Text

Malik S, Patro S, Mahanty C, Lasisi A, et al.: Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering. Scientific Reports. 2024; 14 (1). Publisher Full Text

Altameem A, Mahanty C, Poonia R, Saudagar A, et al.: Breast Cancer Detection in Mammography Images Using Deep Convolutional Neural Networks and Fuzzy Ensemble Modeling Techniques. Diagnostics. 2022; 12 (8). Publisher Full Text

Author :- We thank the reviewer for the valuable suggestion. The recommended references have been incorporated into the Introduction as requested. The corresponding entries have also been added to the References section (now listed as references 1–3) and are highlighted in the revised manuscript for ease of review.
Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Author :-We thank the reviewer for the comment. Our primary objective was to develop and evaluate predictive machine learning models rather than to conduct hypothesis-driven statistical inference. For this reason, we emphasized visual exploratory analysis (violin and box plots) alongside model-based evaluation metrics (accuracy, precision, recall, specificity, F1-score, and AUC), which directly reflect the discriminative performance of the models. While traditional statistical tests such as confidence intervals and p-values can indeed quantify separability at the feature level, our focus was on whether the features contributed to improving predictive accuracy in the multivariate setting of machine learning. We acknowledge this as a limitation and have added it to the manuscript.

Reviewer :-Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
Author :-We thank the reviewer for the observation. Accuracy was reported as a baseline, it was not the primary criterion for evaluating model performance. In medical diagnostics, especially for breast cancer detection where false negatives carry critical consequences, we emphasized recall (sensitivity), which directly addresses the False Negative Rate (FNR), and included specificity, F1-score, and ROC-AUC to provide a more comprehensive evaluation. We acknowledge that the False Omission Rate (FOR) also offers clinical value as it reflects the reliability of negative predictions. Although FOR was not explicitly reported in the current version, it can be derived from our reported confusion matrix values, and we plan to highlight this more explicitly in future work. We have clarified this in the limitations section of the manuscript.

Reviewer :- What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
Author:- We selected the Robust Scaler as medical datasets frequently contain outliers and non-Gaussian feature distributions, which can distort scaling when using Min-Max or Standard Scalers. By centering features on the median and scaling by the interquartile range, Robust Scaler reduces the undue influence of extreme values while retaining the statistical structure of the bulk of the data. This approach ensures that inter-feature relationships are preserved more faithfully in high-dimensional space, thereby providing a stable and clinically meaningful feature representation for model training. In the Method Section

Reviewer :- While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Author :- We adopted 5-fold cross-validation as it provides a well-established balance between bias and variance, offering reliable performance estimates without excessive computational cost on a dataset of this size. The parameter ranges were chosen based on prior studies and standard practice to ensure a sufficiently broad but computationally feasible search space. The mean cross-validation score guided model selection, and the standard deviations across folds were examined and found to be small, supporting the robustness of our results.

Reviewer :-Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
Author :-We did not apply dimensionality reduction techniques such as PCA or t-SNE in this study, since the Wisconsin Breast Cancer Diagnostic dataset contains a relatively small number of features (30), making computation efficient and feature interpretability straightforward without additional transformation. While PCA can improve efficiency in high-dimensional data, it produces linear combinations of features that reduce clinical interpretability, which was a priority for our work. Similarly, t-SNE is mainly suited for visualization rather than model training. Changes made In the Method Section

Reviewer :- How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Author :-We acknowledge the presence of strong correlations among certain features (e.g., 0.86 between concavity worst and concave points worst). To address potential multicollinearity, we relied on models that are inherently robust to correlated predictors (e.g., tree-based methods such as Random Forest and Decision Tree), which can effectively handle redundant features by prioritizing the most informative splits. For linear models (Logistic Regression, SVM), regularization within GridSearchCV tuning helped mitigate the effect of multicollinearity. Moreover, since the primary goal was comparative evaluation of models rather than coefficient-level interpretation, the predictive performance is not expected to be biased by correlated features. Changes made In the Method Section

Reviewer :- Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
Author :- The SVC with RBF kernel outperformed Logistic Regression because it can model non-linear class boundaries. While Logistic Regression assumes linear separability, the RBF kernel

K(xi,xj)=exp(−γ∥xi−xj∥2)

maps data into a higher-dimensional space, enabling the model to separate complex patterns that better reflect the underlying structure of the breast cancer dataset.
-

Reviewer :- What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Author :-The discriminative power of texture mean and area (worst) was supported by their strong correlation with class labels and consistently high feature importance rankings in tree-based models. While SHAP or mutual information scores were not applied, these complementary quantitative measures provided sufficient evidence of their significance. Changes made In the Result Section

Reviewer :-Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Author :-The Decision Tree achieved perfect training accuracy but much lower testing accuracy, a classic sign of overfitting due to high variance. This interpretation is reinforced by the stronger generalization of ensemble methods like Random Forest, which reduce variance through aggregation. Changes added in the limitation section

Reviewer :- Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?
Author :- We thank the reviewer for this valuable comment. We have clarified in the Discussion that an AUC of 0.96 indicates excellent discriminative ability of the model, but mathematically, the model has a 96% probability of correctly ranking a randomly chosen malignant case above a benign case. Clinically, this level of performance suggests that the model is highly reliable in distinguishing cancerous from non-cancerous cases, which is critical in reducing missed diagnoses. Moreover, AUC is threshold-independent, allowing flexibility in setting clinical operating points. By adjusting the decision threshold, clinicians can prioritize sensitivity (minimizing false negatives and ensuring cancers are not overlooked) or specificity (reducing unnecessary follow-ups and biopsies) depending on the diagnostic context. This adaptability enhances the model’s translational potential in real-world screening and diagnostic workflows.
-

Reviewer :-

Mahanty C, Rajesh T, Govil N, Venkateswarulu N, et al.: Effective Alzheimer’s disease detection using enhanced Xception blending with snapshot ensemble. Scientific Reports. 2024; 14 (1). Publisher Full Text

Malik S, Patro S, Mahanty C, Lasisi A, et al.: Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering. Scientific Reports. 2024; 14 (1). Publisher Full Text

Altameem A, Mahanty C, Poonia R, Saudagar A, et al.: Breast Cancer Detection in Mammography Images Using Deep Convolutional Neural Networks and Fuzzy Ensemble Modeling Techniques. Diagnostics. 2022; 12 (8). Publisher Full Text

Author :- We thank the reviewer for the valuable suggestion. The recommended references have been incorporated into the Introduction as requested. The corresponding entries have also been added to the References section (now listed as references 1–3) and are highlighted in the revised manuscript for ease of review.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 10 Sep 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

10 Sep 2025

Author Response
Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, ... Continue reading
Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Author :-We thank the reviewer for the comment. Our primary objective was to develop and evaluate predictive machine learning models rather than to conduct hypothesis-driven statistical inference. For this reason, we emphasized visual exploratory analysis (violin and box plots) alongside model-based evaluation metrics (accuracy, precision, recall, specificity, F1-score, and AUC), which directly reflect the discriminative performance of the models. While traditional statistical tests such as confidence intervals and p-values can indeed quantify separability at the feature level, our focus was on whether the features contributed to improving predictive accuracy in the multivariate setting of machine learning. We acknowledge this as a limitation and have added it to the manuscript.

Reviewer :-Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
Author :-We thank the reviewer for the observation. Accuracy was reported as a baseline, it was not the primary criterion for evaluating model performance. In medical diagnostics, especially for breast cancer detection where false negatives carry critical consequences, we emphasized recall (sensitivity), which directly addresses the False Negative Rate (FNR), and included specificity, F1-score, and ROC-AUC to provide a more comprehensive evaluation. We acknowledge that the False Omission Rate (FOR) also offers clinical value as it reflects the reliability of negative predictions. Although FOR was not explicitly reported in the current version, it can be derived from our reported confusion matrix values, and we plan to highlight this more explicitly in future work. We have clarified this in the limitations section of the manuscript.

Reviewer :- What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
Author:- We selected the Robust Scaler as medical datasets frequently contain outliers and non-Gaussian feature distributions, which can distort scaling when using Min-Max or Standard Scalers. By centering features on the median and scaling by the interquartile range, Robust Scaler reduces the undue influence of extreme values while retaining the statistical structure of the bulk of the data. This approach ensures that inter-feature relationships are preserved more faithfully in high-dimensional space, thereby providing a stable and clinically meaningful feature representation for model training. In the Method Section

Reviewer :- While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Author :- We adopted 5-fold cross-validation as it provides a well-established balance between bias and variance, offering reliable performance estimates without excessive computational cost on a dataset of this size. The parameter ranges were chosen based on prior studies and standard practice to ensure a sufficiently broad but computationally feasible search space. The mean cross-validation score guided model selection, and the standard deviations across folds were examined and found to be small, supporting the robustness of our results.

Reviewer :-Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
Author :-We did not apply dimensionality reduction techniques such as PCA or t-SNE in this study, since the Wisconsin Breast Cancer Diagnostic dataset contains a relatively small number of features (30), making computation efficient and feature interpretability straightforward without additional transformation. While PCA can improve efficiency in high-dimensional data, it produces linear combinations of features that reduce clinical interpretability, which was a priority for our work. Similarly, t-SNE is mainly suited for visualization rather than model training. Changes made In the Method Section

Reviewer :- How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Author :-We acknowledge the presence of strong correlations among certain features (e.g., 0.86 between concavity worst and concave points worst). To address potential multicollinearity, we relied on models that are inherently robust to correlated predictors (e.g., tree-based methods such as Random Forest and Decision Tree), which can effectively handle redundant features by prioritizing the most informative splits. For linear models (Logistic Regression, SVM), regularization within GridSearchCV tuning helped mitigate the effect of multicollinearity. Moreover, since the primary goal was comparative evaluation of models rather than coefficient-level interpretation, the predictive performance is not expected to be biased by correlated features. Changes made In the Method Section

Reviewer :- Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
Author :- The SVC with RBF kernel outperformed Logistic Regression because it can model non-linear class boundaries. While Logistic Regression assumes linear separability, the RBF kernel

K(xi,xj)=exp(−γ∥xi−xj∥2)

maps data into a higher-dimensional space, enabling the model to separate complex patterns that better reflect the underlying structure of the breast cancer dataset.
-

Reviewer :- What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Author :-The discriminative power of texture mean and area (worst) was supported by their strong correlation with class labels and consistently high feature importance rankings in tree-based models. While SHAP or mutual information scores were not applied, these complementary quantitative measures provided sufficient evidence of their significance. Changes made In the Result Section

Reviewer :-Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Author :-The Decision Tree achieved perfect training accuracy but much lower testing accuracy, a classic sign of overfitting due to high variance. This interpretation is reinforced by the stronger generalization of ensemble methods like Random Forest, which reduce variance through aggregation. Changes added in the limitation section

Reviewer :- Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?
Author :- We thank the reviewer for this valuable comment. We have clarified in the Discussion that an AUC of 0.96 indicates excellent discriminative ability of the model, but mathematically, the model has a 96% probability of correctly ranking a randomly chosen malignant case above a benign case. Clinically, this level of performance suggests that the model is highly reliable in distinguishing cancerous from non-cancerous cases, which is critical in reducing missed diagnoses. Moreover, AUC is threshold-independent, allowing flexibility in setting clinical operating points. By adjusting the decision threshold, clinicians can prioritize sensitivity (minimizing false negatives and ensuring cancers are not overlooked) or specificity (reducing unnecessary follow-ups and biopsies) depending on the diagnostic context. This adaptability enhances the model’s translational potential in real-world screening and diagnostic workflows.
-

Reviewer :-

Mahanty C, Rajesh T, Govil N, Venkateswarulu N, et al.: Effective Alzheimer’s disease detection using enhanced Xception blending with snapshot ensemble. Scientific Reports. 2024; 14 (1). Publisher Full Text

Malik S, Patro S, Mahanty C, Lasisi A, et al.: Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering. Scientific Reports. 2024; 14 (1). Publisher Full Text

Altameem A, Mahanty C, Poonia R, Saudagar A, et al.: Breast Cancer Detection in Mammography Images Using Deep Convolutional Neural Networks and Fuzzy Ensemble Modeling Techniques. Diagnostics. 2022; 12 (8). Publisher Full Text

Author :- We thank the reviewer for the valuable suggestion. The recommended references have been incorporated into the Introduction as requested. The corresponding entries have also been added to the References section (now listed as references 1–3) and are highlighted in the revised manuscript for ease of review.
Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Author :-We thank the reviewer for the comment. Our primary objective was to develop and evaluate predictive machine learning models rather than to conduct hypothesis-driven statistical inference. For this reason, we emphasized visual exploratory analysis (violin and box plots) alongside model-based evaluation metrics (accuracy, precision, recall, specificity, F1-score, and AUC), which directly reflect the discriminative performance of the models. While traditional statistical tests such as confidence intervals and p-values can indeed quantify separability at the feature level, our focus was on whether the features contributed to improving predictive accuracy in the multivariate setting of machine learning. We acknowledge this as a limitation and have added it to the manuscript.

Reviewer :-Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
Author :-We thank the reviewer for the observation. Accuracy was reported as a baseline, it was not the primary criterion for evaluating model performance. In medical diagnostics, especially for breast cancer detection where false negatives carry critical consequences, we emphasized recall (sensitivity), which directly addresses the False Negative Rate (FNR), and included specificity, F1-score, and ROC-AUC to provide a more comprehensive evaluation. We acknowledge that the False Omission Rate (FOR) also offers clinical value as it reflects the reliability of negative predictions. Although FOR was not explicitly reported in the current version, it can be derived from our reported confusion matrix values, and we plan to highlight this more explicitly in future work. We have clarified this in the limitations section of the manuscript.

Reviewer :- What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
Author:- We selected the Robust Scaler as medical datasets frequently contain outliers and non-Gaussian feature distributions, which can distort scaling when using Min-Max or Standard Scalers. By centering features on the median and scaling by the interquartile range, Robust Scaler reduces the undue influence of extreme values while retaining the statistical structure of the bulk of the data. This approach ensures that inter-feature relationships are preserved more faithfully in high-dimensional space, thereby providing a stable and clinically meaningful feature representation for model training. In the Method Section

Reviewer :- While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Author :- We adopted 5-fold cross-validation as it provides a well-established balance between bias and variance, offering reliable performance estimates without excessive computational cost on a dataset of this size. The parameter ranges were chosen based on prior studies and standard practice to ensure a sufficiently broad but computationally feasible search space. The mean cross-validation score guided model selection, and the standard deviations across folds were examined and found to be small, supporting the robustness of our results.

Reviewer :-Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
Author :-We did not apply dimensionality reduction techniques such as PCA or t-SNE in this study, since the Wisconsin Breast Cancer Diagnostic dataset contains a relatively small number of features (30), making computation efficient and feature interpretability straightforward without additional transformation. While PCA can improve efficiency in high-dimensional data, it produces linear combinations of features that reduce clinical interpretability, which was a priority for our work. Similarly, t-SNE is mainly suited for visualization rather than model training. Changes made In the Method Section

Reviewer :- How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Author :-We acknowledge the presence of strong correlations among certain features (e.g., 0.86 between concavity worst and concave points worst). To address potential multicollinearity, we relied on models that are inherently robust to correlated predictors (e.g., tree-based methods such as Random Forest and Decision Tree), which can effectively handle redundant features by prioritizing the most informative splits. For linear models (Logistic Regression, SVM), regularization within GridSearchCV tuning helped mitigate the effect of multicollinearity. Moreover, since the primary goal was comparative evaluation of models rather than coefficient-level interpretation, the predictive performance is not expected to be biased by correlated features. Changes made In the Method Section

Reviewer :- Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
Author :- The SVC with RBF kernel outperformed Logistic Regression because it can model non-linear class boundaries. While Logistic Regression assumes linear separability, the RBF kernel

K(xi,xj)=exp(−γ∥xi−xj∥2)

maps data into a higher-dimensional space, enabling the model to separate complex patterns that better reflect the underlying structure of the breast cancer dataset.
-

Reviewer :- What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Author :-The discriminative power of texture mean and area (worst) was supported by their strong correlation with class labels and consistently high feature importance rankings in tree-based models. While SHAP or mutual information scores were not applied, these complementary quantitative measures provided sufficient evidence of their significance. Changes made In the Result Section

Reviewer :-Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Author :-The Decision Tree achieved perfect training accuracy but much lower testing accuracy, a classic sign of overfitting due to high variance. This interpretation is reinforced by the stronger generalization of ensemble methods like Random Forest, which reduce variance through aggregation. Changes added in the limitation section

Reviewer :- Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?
Author :- We thank the reviewer for this valuable comment. We have clarified in the Discussion that an AUC of 0.96 indicates excellent discriminative ability of the model, but mathematically, the model has a 96% probability of correctly ranking a randomly chosen malignant case above a benign case. Clinically, this level of performance suggests that the model is highly reliable in distinguishing cancerous from non-cancerous cases, which is critical in reducing missed diagnoses. Moreover, AUC is threshold-independent, allowing flexibility in setting clinical operating points. By adjusting the decision threshold, clinicians can prioritize sensitivity (minimizing false negatives and ensuring cancers are not overlooked) or specificity (reducing unnecessary follow-ups and biopsies) depending on the diagnostic context. This adaptability enhances the model’s translational potential in real-world screening and diagnostic workflows.
-

Reviewer :-

Mahanty C, Rajesh T, Govil N, Venkateswarulu N, et al.: Effective Alzheimer’s disease detection using enhanced Xception blending with snapshot ensemble. Scientific Reports. 2024; 14 (1). Publisher Full Text

Malik S, Patro S, Mahanty C, Lasisi A, et al.: Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering. Scientific Reports. 2024; 14 (1). Publisher Full Text

Altameem A, Mahanty C, Poonia R, Saudagar A, et al.: Breast Cancer Detection in Mammography Images Using Deep Convolutional Neural Networks and Fuzzy Ensemble Modeling Techniques. Diagnostics. 2022; 12 (8). Publisher Full Text

Author :- We thank the reviewer for the valuable suggestion. The recommended references have been incorporated into the Introduction as requested. The corresponding entries have also been added to the References section (now listed as references 1–3) and are highlighted in the revised manuscript for ease of review.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 03 Jun 2025

Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands

Not Approved

https://doi.org/10.5256/f1000research.182136.r385232

The study applied machine learning models for early detection and classification of breast cancer using the Wisconsin Breast Cancer Diagnostic dataset. Five supervised ML algorithms (Logistic Regression, Support Vector Classification (SVC) with linear and radial basis function (RBF) kernels, Decision Tree, and Random Forest) were implemented and evaluated using performance metrics, including accuracy, precision, sensitivity, specificity, and F1 scores.

Major comments:
The study properly implemented ML algorithms, however, due to the data imbalance, accuracy is not the best measure to compare models, but the F1-score is ok. Also, in the case of the early detection of breast cancer, I advice the authors to include FNR and FOR as performance metrics to compare algorithms. The reasons are described below:

1) FNR measures the proportion of actual positive cases of breast cancer that are incorrectly classified as negative cases, it quantifies the rate of missed positives (Type II errors), and hence a high FNR implies late detection of anomalies.

2) FOR measures the proportion of false negative errors or incorrect omissions in a decision-making process. As FOR captures failures to detect breast cancer, it is also relevant in the comparison of machine learning and deep learning algorithms, because missing the detection of a positive condition of breast cancer can have significant health consequences for cancer patients.

Thus, I suggest the authors to include these metrics in the core evaluation of their proposed models, as in Gonzales-Martinez and van Dongen (2023)[Ref 1]

The validity of the findings and the conclusions linked to the findings should be evaluated on the basis of the lowest FNR and FOR, and not only on the accuracy of the ML and DL algorithms, since, the high level of accuracy found in the paper may be indicative of imbalance problems.

References
1. Gonzales Martinez R, van Dongen D: Deep learning algorithms for the early detection of breast cancer: A comparative study with traditional machine learning. Informatics in Medicine Unlocked. 2023; 41. Publisher Full Text

References

1. Gonzales Martinez R, van Dongen D: Deep learning algorithms for the early detection of breast cancer: A comparative study with traditional machine learning. Informatics in Medicine Unlocked. 2023; 41. Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Machine learning and deep learning applied to health

CITE

Report a concern

Author Response 10 Sep 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

10 Sep 2025

Author Response

Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning ... Continue reading Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning applications. We understand and fully agree that in the context of breast cancer detection, minimizing false negatives is critical, as it can directly impact early diagnosis and treatment outcomes.

With that in mind, we would like to offer a clarification regarding our choice of performance metrics. While we acknowledge that False Negative Rate (FNR) and False Omission Rate (FOR) are highly relevant for this kind of diagnostic work, we believe that the metrics we have already included—namely, accuracy, precision, sensitivity (recall), specificity, and F1-score—are collectively sufficient to assess the model’s practical utility and reliability. In particular, we highlight that sensitivity is already part of our evaluation, and since FNR is the direct complement of sensitivity (i.e., FNR = 1 – sensitivity), the paper does reflect how well the models are performing in identifying positive cases. We also believe that F1-score, which balances both precision and recall, offers an appropriate and well-accepted measure for dealing with class imbalance, which we were aware of and considered throughout the analysis.

As for the suggestion to include FOR, we appreciate its relevance. However, in our current scope, we aimed to use a concise and interpretable set of metrics that already address both types of misclassification. Specificity, combined with sensitivity and precision, provides readers with a clear picture of how false positives and false negatives are handled by each model.

Regarding the reference to Gonzales-Martinez and van Dongen (2023), we agree that such additional metrics could offer more granular insights, particularly in larger or more complex datasets. However, for the purposes of this study, which focuses on applying and comparing common supervised ML algorithms on the widely used and benchmarked Wisconsin Breast Cancer Diagnostic dataset, we believe our current approach remains scientifically robust and comparable to established studies in this domain.

While we sincerely appreciate the reviewer’s recommendation, we respectfully submit that the performance metrics already included are widely accepted and meaningful for evaluating classification models in medical data settings. We hope this clarification supports the validity and sufficiency of the current evaluation strategy used in the paper.
Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning applications. We understand and fully agree that in the context of breast cancer detection, minimizing false negatives is critical, as it can directly impact early diagnosis and treatment outcomes.

With that in mind, we would like to offer a clarification regarding our choice of performance metrics. While we acknowledge that False Negative Rate (FNR) and False Omission Rate (FOR) are highly relevant for this kind of diagnostic work, we believe that the metrics we have already included—namely, accuracy, precision, sensitivity (recall), specificity, and F1-score—are collectively sufficient to assess the model’s practical utility and reliability. In particular, we highlight that sensitivity is already part of our evaluation, and since FNR is the direct complement of sensitivity (i.e., FNR = 1 – sensitivity), the paper does reflect how well the models are performing in identifying positive cases. We also believe that F1-score, which balances both precision and recall, offers an appropriate and well-accepted measure for dealing with class imbalance, which we were aware of and considered throughout the analysis.

As for the suggestion to include FOR, we appreciate its relevance. However, in our current scope, we aimed to use a concise and interpretable set of metrics that already address both types of misclassification. Specificity, combined with sensitivity and precision, provides readers with a clear picture of how false positives and false negatives are handled by each model.

Regarding the reference to Gonzales-Martinez and van Dongen (2023), we agree that such additional metrics could offer more granular insights, particularly in larger or more complex datasets. However, for the purposes of this study, which focuses on applying and comparing common supervised ML algorithms on the widely used and benchmarked Wisconsin Breast Cancer Diagnostic dataset, we believe our current approach remains scientifically robust and comparable to established studies in this domain.

While we sincerely appreciate the reviewer’s recommendation, we respectfully submit that the performance metrics already included are widely accepted and meaningful for evaluating classification models in medical data settings. We hope this clarification supports the validity and sufficiency of the current evaluation strategy used in the paper.
Competing Interests: No competing interests what so ever to declare. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 10 Sep 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

10 Sep 2025

Author Response

Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning ... Continue reading Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning applications. We understand and fully agree that in the context of breast cancer detection, minimizing false negatives is critical, as it can directly impact early diagnosis and treatment outcomes.

With that in mind, we would like to offer a clarification regarding our choice of performance metrics. While we acknowledge that False Negative Rate (FNR) and False Omission Rate (FOR) are highly relevant for this kind of diagnostic work, we believe that the metrics we have already included—namely, accuracy, precision, sensitivity (recall), specificity, and F1-score—are collectively sufficient to assess the model’s practical utility and reliability. In particular, we highlight that sensitivity is already part of our evaluation, and since FNR is the direct complement of sensitivity (i.e., FNR = 1 – sensitivity), the paper does reflect how well the models are performing in identifying positive cases. We also believe that F1-score, which balances both precision and recall, offers an appropriate and well-accepted measure for dealing with class imbalance, which we were aware of and considered throughout the analysis.

As for the suggestion to include FOR, we appreciate its relevance. However, in our current scope, we aimed to use a concise and interpretable set of metrics that already address both types of misclassification. Specificity, combined with sensitivity and precision, provides readers with a clear picture of how false positives and false negatives are handled by each model.

Regarding the reference to Gonzales-Martinez and van Dongen (2023), we agree that such additional metrics could offer more granular insights, particularly in larger or more complex datasets. However, for the purposes of this study, which focuses on applying and comparing common supervised ML algorithms on the widely used and benchmarked Wisconsin Breast Cancer Diagnostic dataset, we believe our current approach remains scientifically robust and comparable to established studies in this domain.

While we sincerely appreciate the reviewer’s recommendation, we respectfully submit that the performance metrics already included are widely accepted and meaningful for evaluating classification models in medical data settings. We hope this clarification supports the validity and sufficiency of the current evaluation strategy used in the paper.
Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning applications. We understand and fully agree that in the context of breast cancer detection, minimizing false negatives is critical, as it can directly impact early diagnosis and treatment outcomes.

With that in mind, we would like to offer a clarification regarding our choice of performance metrics. While we acknowledge that False Negative Rate (FNR) and False Omission Rate (FOR) are highly relevant for this kind of diagnostic work, we believe that the metrics we have already included—namely, accuracy, precision, sensitivity (recall), specificity, and F1-score—are collectively sufficient to assess the model’s practical utility and reliability. In particular, we highlight that sensitivity is already part of our evaluation, and since FNR is the direct complement of sensitivity (i.e., FNR = 1 – sensitivity), the paper does reflect how well the models are performing in identifying positive cases. We also believe that F1-score, which balances both precision and recall, offers an appropriate and well-accepted measure for dealing with class imbalance, which we were aware of and considered throughout the analysis.

As for the suggestion to include FOR, we appreciate its relevance. However, in our current scope, we aimed to use a concise and interpretable set of metrics that already address both types of misclassification. Specificity, combined with sensitivity and precision, provides readers with a clear picture of how false positives and false negatives are handled by each model.

Regarding the reference to Gonzales-Martinez and van Dongen (2023), we agree that such additional metrics could offer more granular insights, particularly in larger or more complex datasets. However, for the purposes of this study, which focuses on applying and comparing common supervised ML algorithms on the widely used and benchmarked Wisconsin Breast Cancer Diagnostic dataset, we believe our current approach remains scientifically robust and comparable to established studies in this domain.

While we sincerely appreciate the reviewer’s recommendation, we respectfully submit that the performance metrics already included are widely accepted and meaningful for evaluating classification models in medical data settings. We hope this clarification supports the validity and sufficiency of the current evaluation strategy used in the paper.
Competing Interests: No competing interests what so ever to declare. Close
Report a concern

Version 2

VERSION 2

PUBLISHED 10 Apr 2025

Revised

Views

Reviewer Report 12 May 2025

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

Approved with Reservations

https://doi.org/10.5256/f1000research.180235.r377120

The study has incorporated the previous concerns of the data. Now with the five-fold cross-validation, the overfitting has been addressed as clearly shown by the 1.0 accuracy of the Decision Tree in training. While the general issue with ML studies without a validation dataset is this overfitting, i.e., the model might memorize the data rather than learning distinguishable patterns, the improvements are in good standards. This has been addressed in the limitations and future directions.

This study relies primarily on visual methods such as violin plots, for feature selection. Although proven effective with the ML performance, there might be nonlinear patterns that this method misses but an algorithm like SVC-RBF is 'capturing'.

I approve this paper after a few clarifications or corrections:

1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in itself is a 'black box' ML approach. It is not easily 'explainable' or interpretable. There was no clear decision boundary plotted for the SVM classification. Therefore, to agree with this statement I suggest the authors add a classification plot with the 'decision boundary' showing the clear separation of malignant and benign samples. Or some other 'interpretability technique should be offered (e.g., Gini entropy, feature importance, salience maps, etc.).

2) The results over-emphasize accuracy as the primary metric. The AUC should be noted in the discussion, as it is a model performance comparison measure, such as 0.96 for the RBF-SVC. Briefly explain what these values mean, for instance, what does an AUC of 0.96 indicate about that algorithm on this dataset.

3) The Methods can be made more transparent. For instance, what are the hyperparameters of the algorithms? How was the cross-validation performed? etc. Sections 2.3-2.5 can use some more details to ensure replicability. Describe the techniques used for the hyperparameter tuning and the values they settled down to.

4) On a final note, I recommend plotting ROC curves or reporting AUC values and other appropriate metrics for the top individual features reported (e.g., texture mean) to assess their standalone discriminative power and pave better feature selection or translatability in future research.

Overall, great work. Looking forward to the final form.

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: AI, Machine learning, Bioinformatics, and Systems Oncology

CITE

Report a concern

Author Response 16 Jun 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

16 Jun 2025

Author Response

1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in ... Continue reading 1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in itself is a 'black box' ML approach. It is not easily 'explainable' or interpretable. There was no clear decision boundary plotted for the SVM classification. Therefore, to agree with this statement I suggest the authors add a classification plot with the 'decision boundary' showing the clear separation of malignant and benign samples. Or some other 'interpretability technique should be offered (e.g., Gini entropy, feature importance, salience maps, etc.).
Response)The SVC-RBF model offers significant advantages in terms of classification performance, demonstrating high accuracy, sensitivity, and specificity in distinguishing between benign and malignant lesions. While the model operates as a black-box algorithm with limited inherent interpretability, its strong predictive capability makes it a valuable candidate for decision-support applications in clinical settings. To enhance clinician trust and eventual translatability, future work will focus on integrating model-agnostic interpretability techniques, such as SHAP values or feature attribution methods, to improve transparency and support clinical decision-making.

2) The results over-emphasize accuracy as the primary metric. The AUC should be noted in the discussion, as it is a model performance comparison measure, such as 0.96 for the RBF-SVC. Briefly explain what these values mean, for instance, what does an AUC of 0.96 indicate about that algorithm on this dataset.
Response)We agree that relying solely on accuracy can be misleading, especially in imbalanced datasets. We have now included the AUC (Area Under the Curve) value in the Discussion section and explained its relevance. Specifically, for the SVC-RBF model, an AUC of 0.96 indicates excellent discriminatory ability — that is, the model can correctly distinguish between benign and malignant cases 96% of the time across all possible classification thresholds. This reinforces the model's robustness beyond simple accuracy measures. The revised discussion reflects this clarification.

3) The Methods can be made more transparent. For instance, what are the hyperparameters of the algorithms? How was the cross-validation performed? etc. Sections 2.3-2.5 can use some more details to ensure replicability. Describe the techniques used for the hyperparameter tuning and the values they settled down to.
Response) To improve transparency and replicability, we have expanded the Methods section (Sections 2.3–2.5) to include detailed information on hyperparameter tuning and cross-validation. Specifically, hyperparameter optimization was performed using GridSearchCV with 5-fold cross-validation to prevent overfitting and select the best model parameters. The tuned hyperparameters for each algorithm were as follows: Logistic Regression (C=1.0), Support Vector Classifiers (C=1.0 for linear kernel, and C=1.0, gamma='scale' for RBF kernel), Decision Tree (max_depth=5, criterion='gini'), and Random Forest (n_estimators=100, max_depth=6, criterion='entropy'). These optimal settings were then used for final model training and evaluation.

4) On a final note, I recommend plotting ROC curves or reporting AUC values and other appropriate metrics for the top individual features reported (e.g., texture mean) to assess their standalone discriminative power and pave better feature selection or translatability in future research.
Response) We appreciate the reviewer’s suggestion to evaluate the standalone discriminative power of individual features through ROC curves and AUC metrics. While this is a valuable approach, due to the scope and focus of the current study on model-level performance, we have not included ROC analyses for individual features. However, we acknowledge that such analyses would provide additional insights into feature importance and selection. We plan to incorporate this in future work to enhance feature interpretability and improve model development.
1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in itself is a 'black box' ML approach. It is not easily 'explainable' or interpretable. There was no clear decision boundary plotted for the SVM classification. Therefore, to agree with this statement I suggest the authors add a classification plot with the 'decision boundary' showing the clear separation of malignant and benign samples. Or some other 'interpretability technique should be offered (e.g., Gini entropy, feature importance, salience maps, etc.).
Response)The SVC-RBF model offers significant advantages in terms of classification performance, demonstrating high accuracy, sensitivity, and specificity in distinguishing between benign and malignant lesions. While the model operates as a black-box algorithm with limited inherent interpretability, its strong predictive capability makes it a valuable candidate for decision-support applications in clinical settings. To enhance clinician trust and eventual translatability, future work will focus on integrating model-agnostic interpretability techniques, such as SHAP values or feature attribution methods, to improve transparency and support clinical decision-making.

2) The results over-emphasize accuracy as the primary metric. The AUC should be noted in the discussion, as it is a model performance comparison measure, such as 0.96 for the RBF-SVC. Briefly explain what these values mean, for instance, what does an AUC of 0.96 indicate about that algorithm on this dataset.
Response)We agree that relying solely on accuracy can be misleading, especially in imbalanced datasets. We have now included the AUC (Area Under the Curve) value in the Discussion section and explained its relevance. Specifically, for the SVC-RBF model, an AUC of 0.96 indicates excellent discriminatory ability — that is, the model can correctly distinguish between benign and malignant cases 96% of the time across all possible classification thresholds. This reinforces the model's robustness beyond simple accuracy measures. The revised discussion reflects this clarification.

3) The Methods can be made more transparent. For instance, what are the hyperparameters of the algorithms? How was the cross-validation performed? etc. Sections 2.3-2.5 can use some more details to ensure replicability. Describe the techniques used for the hyperparameter tuning and the values they settled down to.
Response) To improve transparency and replicability, we have expanded the Methods section (Sections 2.3–2.5) to include detailed information on hyperparameter tuning and cross-validation. Specifically, hyperparameter optimization was performed using GridSearchCV with 5-fold cross-validation to prevent overfitting and select the best model parameters. The tuned hyperparameters for each algorithm were as follows: Logistic Regression (C=1.0), Support Vector Classifiers (C=1.0 for linear kernel, and C=1.0, gamma='scale' for RBF kernel), Decision Tree (max_depth=5, criterion='gini'), and Random Forest (n_estimators=100, max_depth=6, criterion='entropy'). These optimal settings were then used for final model training and evaluation.

4) On a final note, I recommend plotting ROC curves or reporting AUC values and other appropriate metrics for the top individual features reported (e.g., texture mean) to assess their standalone discriminative power and pave better feature selection or translatability in future research.
Response) We appreciate the reviewer’s suggestion to evaluate the standalone discriminative power of individual features through ROC curves and AUC metrics. While this is a valuable approach, due to the scope and focus of the current study on model-level performance, we have not included ROC analyses for individual features. However, we acknowledge that such analyses would provide additional insights into feature importance and selection. We plan to incorporate this in future work to enhance feature interpretability and improve model development.
Competing Interests: Nil Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

16 Jun 2025

Author Response

1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in ... Continue reading 1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in itself is a 'black box' ML approach. It is not easily 'explainable' or interpretable. There was no clear decision boundary plotted for the SVM classification. Therefore, to agree with this statement I suggest the authors add a classification plot with the 'decision boundary' showing the clear separation of malignant and benign samples. Or some other 'interpretability technique should be offered (e.g., Gini entropy, feature importance, salience maps, etc.).
Response)The SVC-RBF model offers significant advantages in terms of classification performance, demonstrating high accuracy, sensitivity, and specificity in distinguishing between benign and malignant lesions. While the model operates as a black-box algorithm with limited inherent interpretability, its strong predictive capability makes it a valuable candidate for decision-support applications in clinical settings. To enhance clinician trust and eventual translatability, future work will focus on integrating model-agnostic interpretability techniques, such as SHAP values or feature attribution methods, to improve transparency and support clinical decision-making.

2) The results over-emphasize accuracy as the primary metric. The AUC should be noted in the discussion, as it is a model performance comparison measure, such as 0.96 for the RBF-SVC. Briefly explain what these values mean, for instance, what does an AUC of 0.96 indicate about that algorithm on this dataset.
Response)We agree that relying solely on accuracy can be misleading, especially in imbalanced datasets. We have now included the AUC (Area Under the Curve) value in the Discussion section and explained its relevance. Specifically, for the SVC-RBF model, an AUC of 0.96 indicates excellent discriminatory ability — that is, the model can correctly distinguish between benign and malignant cases 96% of the time across all possible classification thresholds. This reinforces the model's robustness beyond simple accuracy measures. The revised discussion reflects this clarification.

3) The Methods can be made more transparent. For instance, what are the hyperparameters of the algorithms? How was the cross-validation performed? etc. Sections 2.3-2.5 can use some more details to ensure replicability. Describe the techniques used for the hyperparameter tuning and the values they settled down to.
Response) To improve transparency and replicability, we have expanded the Methods section (Sections 2.3–2.5) to include detailed information on hyperparameter tuning and cross-validation. Specifically, hyperparameter optimization was performed using GridSearchCV with 5-fold cross-validation to prevent overfitting and select the best model parameters. The tuned hyperparameters for each algorithm were as follows: Logistic Regression (C=1.0), Support Vector Classifiers (C=1.0 for linear kernel, and C=1.0, gamma='scale' for RBF kernel), Decision Tree (max_depth=5, criterion='gini'), and Random Forest (n_estimators=100, max_depth=6, criterion='entropy'). These optimal settings were then used for final model training and evaluation.

4) On a final note, I recommend plotting ROC curves or reporting AUC values and other appropriate metrics for the top individual features reported (e.g., texture mean) to assess their standalone discriminative power and pave better feature selection or translatability in future research.
Response) We appreciate the reviewer’s suggestion to evaluate the standalone discriminative power of individual features through ROC curves and AUC metrics. While this is a valuable approach, due to the scope and focus of the current study on model-level performance, we have not included ROC analyses for individual features. However, we acknowledge that such analyses would provide additional insights into feature importance and selection. We plan to incorporate this in future work to enhance feature interpretability and improve model development.
1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in itself is a 'black box' ML approach. It is not easily 'explainable' or interpretable. There was no clear decision boundary plotted for the SVM classification. Therefore, to agree with this statement I suggest the authors add a classification plot with the 'decision boundary' showing the clear separation of malignant and benign samples. Or some other 'interpretability technique should be offered (e.g., Gini entropy, feature importance, salience maps, etc.).
Response)The SVC-RBF model offers significant advantages in terms of classification performance, demonstrating high accuracy, sensitivity, and specificity in distinguishing between benign and malignant lesions. While the model operates as a black-box algorithm with limited inherent interpretability, its strong predictive capability makes it a valuable candidate for decision-support applications in clinical settings. To enhance clinician trust and eventual translatability, future work will focus on integrating model-agnostic interpretability techniques, such as SHAP values or feature attribution methods, to improve transparency and support clinical decision-making.

2) The results over-emphasize accuracy as the primary metric. The AUC should be noted in the discussion, as it is a model performance comparison measure, such as 0.96 for the RBF-SVC. Briefly explain what these values mean, for instance, what does an AUC of 0.96 indicate about that algorithm on this dataset.
Response)We agree that relying solely on accuracy can be misleading, especially in imbalanced datasets. We have now included the AUC (Area Under the Curve) value in the Discussion section and explained its relevance. Specifically, for the SVC-RBF model, an AUC of 0.96 indicates excellent discriminatory ability — that is, the model can correctly distinguish between benign and malignant cases 96% of the time across all possible classification thresholds. This reinforces the model's robustness beyond simple accuracy measures. The revised discussion reflects this clarification.

3) The Methods can be made more transparent. For instance, what are the hyperparameters of the algorithms? How was the cross-validation performed? etc. Sections 2.3-2.5 can use some more details to ensure replicability. Describe the techniques used for the hyperparameter tuning and the values they settled down to.
Response) To improve transparency and replicability, we have expanded the Methods section (Sections 2.3–2.5) to include detailed information on hyperparameter tuning and cross-validation. Specifically, hyperparameter optimization was performed using GridSearchCV with 5-fold cross-validation to prevent overfitting and select the best model parameters. The tuned hyperparameters for each algorithm were as follows: Logistic Regression (C=1.0), Support Vector Classifiers (C=1.0 for linear kernel, and C=1.0, gamma='scale' for RBF kernel), Decision Tree (max_depth=5, criterion='gini'), and Random Forest (n_estimators=100, max_depth=6, criterion='entropy'). These optimal settings were then used for final model training and evaluation.

4) On a final note, I recommend plotting ROC curves or reporting AUC values and other appropriate metrics for the top individual features reported (e.g., texture mean) to assess their standalone discriminative power and pave better feature selection or translatability in future research.
Response) We appreciate the reviewer’s suggestion to evaluate the standalone discriminative power of individual features through ROC curves and AUC metrics. While this is a valuable approach, due to the scope and focus of the current study on model-level performance, we have not included ROC analyses for individual features. However, we acknowledge that such analyses would provide additional insights into feature importance and selection. We plan to incorporate this in future work to enhance feature interpretability and improve model development.
Competing Interests: Nil Close
Report a concern

Views

Reviewer Report 07 May 2025

Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands

Not Approved

https://doi.org/10.5256/f1000research.180235.r377119

As in my previous review, I would like to highlight that, due to the data imbalance, accuracy is not the best measure to compare models, and while F1-score is ok, in the case of the early detection of breast cancer, I advice again the authors to include FNR and FOR as performance metrics to compare algorithms as in Gonzales-Martinez and van Dongen (2023)[Ref 1]. The reasons are described below:

1) FNR measures the proportion of actual positive cases of breast cancer that are incorrectly classified as negative cases, it quantifies the rate of missed positives (Type II errors), and hence a high FNR implies late detection of anomalies.

2) FOR measures the proportion of false negative errors or incorrect omissions in a decision-making process. As FOR captures failures to detect breast cancer, it is also relevant in the comparison of machine learning and deep learning algorithms, because missing the detection of a positive condition of breast cancer can have significant health consequences for cancer patients.

The validity of the findings and the conclusions linked to the findings should be evaluated on the basis of the lowest FNR and FOR, and not only on the accuracy of the ML and DL algorithms, since, the high level of accuracy found in the paper may be indicative of imbalance problems.

References

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Machine learning and deep learning applied to health

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 05 Feb 2025

Views

Reviewer Report 11 Mar 2025

Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands

Not Approved

https://doi.org/10.5256/f1000research.177056.r364943

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Machine learning and deep learning applied to health

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 10 Mar 2025

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

Approved with Reservations

https://doi.org/10.5256/f1000research.177056.r367118

The paper presents a clear and well-grounded objective, with adequate feature selection and methodology. The use of multiple ML metrics is appreciated and adds value to support the thesis. However, some central issues need to be revised:

1) Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and the rationale for your approach. Some examples are:

Hussain, S., et al., (2024). [Ref-1]

Uthamacumaran, A., et al., (2023). [Ref-2]

2) Limitations section is needed and should address the following:

- The study relies solely on the Wisconsin dataset. Does this generalize to other datasets? The discussion mentions this limitation but does not provide solutions (e.g., external validation with larger datasets, multimodal imaging data). Either present a validation or argue for why this validation holds.

- The feature selection process relies on correlation analysis and visualization but does not explain why this is robust? For instance, what would a PCA analysis do. See Uthamacumaran et al. above for such techniques in dimensionality reduction.

-Interpretability of the SVC kernel choice. You said RBF is accurate but why? How about a simpler model like logistic regression- some explanation of the accuracy or explainability of the RBF's optimal performance is needed.

3) In extension to point #2, the results are lacking validation ROC curves. There are no confusion matrices or ROC curves present. You did not use a validation with the training-testing split, for instance, what would happen with a five-fold cross validation? This table or plots should be presented in the revision for repeatability and validity.

4) Some graphs (e.g., violin plots) are useful but lack quantitative annotations (e.g., exact p-values, confidence intervals for feature separability). Same for the first bar plot, there are no confidence intervals.

5) The codes are not clear. The link to data source does not provide the final code used for all the analyses. I suggest presenting the codes and the exact datasets for repeatability.

Overall, this paper presents great value and could be indexed upon addressing these ML practices concerns. Best of luck.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Hussain S, Ali M, Naseem U, Nezhadmoghadam F, et al.: Breast cancer risk prediction using machine learning: a systematic review.Front Oncol. 2024; 14: 1343627 PubMed Abstract | Publisher Full Text
2. Uthamacumaran A, Abdouh M, Sengupta K, Gao Z, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Computing and Applications. 2023; 35 (11): 8407-8422 Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: AI, machine learning, bioinformatics, and precision oncology

CITE

Report a concern

Author Response 24 Jun 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

24 Jun 2025

Author Response

Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and ... Continue reading Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and the rationale for your approach. Some examples are:
Hussain, S., et al., (2024). [Ref-1]
Uthamacumaran, A., et al., (2023). [Ref-2]
1. Hussain S, Ali M, Naseem U, Nezhadmoghadam F, et al.: Breast cancer risk prediction using machine learning: a systematic review.Front Oncol. 2024; 14: 1343627 PubMed Abstract | Publisher Full Text
2. Uthamacumaran A, Abdouh M, Sengupta K, Gao Z, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Computing and Applications. 2023; 35 (11): 8407-8422 Publisher Full Text
Thank you for your valuable comments. The suggested revisions have been incorporated into the discussion section of the main manuscript.

2. Limitations section is needed and should address the following:
The study relies solely on the Wisconsin dataset. Does this generalize to other datasets? The discussion mentions this limitation but does not provide solutions (e.g., external validation with larger datasets, multimodal imaging data). Either present a validation or argue for why this validation holds.
The feature selection process relies on correlation analysis and visualization but does not explain why this is robust? For instance, what would a PCA analysis do. See Uthamacumaran et al. above for such techniques in dimensionality reduction.
Interpretability of the SVC kernel choice. You said RBF is accurate but why? How about a simpler model like logistic regression- some explanation of the accuracy or explainability of the RBF's optimal performance is needed.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

3. In extension to point #2, the results are lacking validation ROC curves. There are no confusion matrices or ROC curves present. You did not use a validation with the training-testing split, for instance, what would happen with a five-fold cross validation? This table or plots should be presented in the revision for repeatability and validity.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Results section of the main manuscript.

4. Some graphs (e.g., violin plots) are useful but lack quantitative annotations (e.g., exact p-values, confidence intervals for feature separability). Same for the first bar plot, there are no confidence intervals.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

5 The codes are not clear. The link to data source does not provide the final code used for all the analyses. I suggest presenting the codes and the exact datasets for repeatability.
Thank you for your valuable comments. The codes are given below in this document.

The codes used in this study are presented below
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset
data = pd.read_csv('cancer_data.csv')
print(data.info())

# Preprocessing the dataset
# Dropping redundant columns
data = data.drop(['id', 'Unnamed: 32'], axis=1)

# Label encoding the target variable (diagnosis)
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis']) # Malignant (M): 1, Benign (B): 0

# Splitting features (X) and target variable (y)
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Train-test split (60:40 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Feature scaling using RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualizing key features (Violin plot for texture mean)
sns.violinplot(x='diagnosis', y='texture_mean', data=data)
plt.title('Violin Plot for Texture Mean vs Diagnosis')
plt.show()

# Visualizing correlation (Joint plot for concavity worst vs concave points worst)
sns.jointplot(x='concavity_worst', y='concave_points_worst', data=data, kind='scatter')
plt.title('Joint Plot for Concavity Worst vs Concave Points Worst')
plt.show()

# Model development and training
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
log_reg_pred = log_reg.predict(X_test_scaled)

# SVC - Linear Kernel
svc_linear = SVC(kernel='linear')
svc_linear.fit(X_train_scaled, y_train)
svc_linear_pred = svc_linear.predict(X_test_scaled)

# SVC - RBF Kernel
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(X_train_scaled, y_train)
svc_rbf_pred = svc_rbf.predict(X_test_scaled)

# Decision Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_scaled, y_train)
decision_tree_pred = decision_tree.predict(X_test_scaled)

# Random Forest Classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)
random_forest_pred = random_forest.predict(X_test_scaled)

# Evaluating models
models = {
    "Logistic Regression": log_reg_pred,
    "SVC Linear": svc_linear_pred,
    "SVC RBF": svc_rbf_pred,
    "Decision Tree": decision_tree_pred,
    "Random Forest": random_forest_pred
}

for name, predictions in models.items():
    print(f"{name} Performance Metrics:")
    print(classification_report(y_test, predictions))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("-" * 50)
Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and the rationale for your approach. Some examples are:
Hussain, S., et al., (2024). [Ref-1]
Uthamacumaran, A., et al., (2023). [Ref-2]
1. Hussain S, Ali M, Naseem U, Nezhadmoghadam F, et al.: Breast cancer risk prediction using machine learning: a systematic review.Front Oncol. 2024; 14: 1343627 PubMed Abstract | Publisher Full Text
2. Uthamacumaran A, Abdouh M, Sengupta K, Gao Z, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Computing and Applications. 2023; 35 (11): 8407-8422 Publisher Full Text
Thank you for your valuable comments. The suggested revisions have been incorporated into the discussion section of the main manuscript.

2. Limitations section is needed and should address the following:
The study relies solely on the Wisconsin dataset. Does this generalize to other datasets? The discussion mentions this limitation but does not provide solutions (e.g., external validation with larger datasets, multimodal imaging data). Either present a validation or argue for why this validation holds.
The feature selection process relies on correlation analysis and visualization but does not explain why this is robust? For instance, what would a PCA analysis do. See Uthamacumaran et al. above for such techniques in dimensionality reduction.
Interpretability of the SVC kernel choice. You said RBF is accurate but why? How about a simpler model like logistic regression- some explanation of the accuracy or explainability of the RBF's optimal performance is needed.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

3. In extension to point #2, the results are lacking validation ROC curves. There are no confusion matrices or ROC curves present. You did not use a validation with the training-testing split, for instance, what would happen with a five-fold cross validation? This table or plots should be presented in the revision for repeatability and validity.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Results section of the main manuscript.

4. Some graphs (e.g., violin plots) are useful but lack quantitative annotations (e.g., exact p-values, confidence intervals for feature separability). Same for the first bar plot, there are no confidence intervals.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

5 The codes are not clear. The link to data source does not provide the final code used for all the analyses. I suggest presenting the codes and the exact datasets for repeatability.
Thank you for your valuable comments. The codes are given below in this document.

The codes used in this study are presented below
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset
data = pd.read_csv('cancer_data.csv')
print(data.info())

# Preprocessing the dataset
# Dropping redundant columns
data = data.drop(['id', 'Unnamed: 32'], axis=1)

# Label encoding the target variable (diagnosis)
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis']) # Malignant (M): 1, Benign (B): 0

# Splitting features (X) and target variable (y)
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Train-test split (60:40 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Feature scaling using RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualizing key features (Violin plot for texture mean)
sns.violinplot(x='diagnosis', y='texture_mean', data=data)
plt.title('Violin Plot for Texture Mean vs Diagnosis')
plt.show()

# Visualizing correlation (Joint plot for concavity worst vs concave points worst)
sns.jointplot(x='concavity_worst', y='concave_points_worst', data=data, kind='scatter')
plt.title('Joint Plot for Concavity Worst vs Concave Points Worst')
plt.show()

# Model development and training
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
log_reg_pred = log_reg.predict(X_test_scaled)

# SVC - Linear Kernel
svc_linear = SVC(kernel='linear')
svc_linear.fit(X_train_scaled, y_train)
svc_linear_pred = svc_linear.predict(X_test_scaled)

# SVC - RBF Kernel
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(X_train_scaled, y_train)
svc_rbf_pred = svc_rbf.predict(X_test_scaled)

# Decision Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_scaled, y_train)
decision_tree_pred = decision_tree.predict(X_test_scaled)

# Random Forest Classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)
random_forest_pred = random_forest.predict(X_test_scaled)

# Evaluating models
models = {
    "Logistic Regression": log_reg_pred,
    "SVC Linear": svc_linear_pred,
    "SVC RBF": svc_rbf_pred,
    "Decision Tree": decision_tree_pred,
    "Random Forest": random_forest_pred
}

for name, predictions in models.items():
    print(f"{name} Performance Metrics:")
    print(classification_report(y_test, predictions))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("-" * 50)
Competing Interests: Nil Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 24 Jun 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

24 Jun 2025

Author Response

Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and ... Continue reading Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and the rationale for your approach. Some examples are:
Hussain, S., et al., (2024). [Ref-1]
Uthamacumaran, A., et al., (2023). [Ref-2]
1. Hussain S, Ali M, Naseem U, Nezhadmoghadam F, et al.: Breast cancer risk prediction using machine learning: a systematic review.Front Oncol. 2024; 14: 1343627 PubMed Abstract | Publisher Full Text
2. Uthamacumaran A, Abdouh M, Sengupta K, Gao Z, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Computing and Applications. 2023; 35 (11): 8407-8422 Publisher Full Text
Thank you for your valuable comments. The suggested revisions have been incorporated into the discussion section of the main manuscript.

2. Limitations section is needed and should address the following:
The study relies solely on the Wisconsin dataset. Does this generalize to other datasets? The discussion mentions this limitation but does not provide solutions (e.g., external validation with larger datasets, multimodal imaging data). Either present a validation or argue for why this validation holds.
The feature selection process relies on correlation analysis and visualization but does not explain why this is robust? For instance, what would a PCA analysis do. See Uthamacumaran et al. above for such techniques in dimensionality reduction.
Interpretability of the SVC kernel choice. You said RBF is accurate but why? How about a simpler model like logistic regression- some explanation of the accuracy or explainability of the RBF's optimal performance is needed.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

3. In extension to point #2, the results are lacking validation ROC curves. There are no confusion matrices or ROC curves present. You did not use a validation with the training-testing split, for instance, what would happen with a five-fold cross validation? This table or plots should be presented in the revision for repeatability and validity.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Results section of the main manuscript.

4. Some graphs (e.g., violin plots) are useful but lack quantitative annotations (e.g., exact p-values, confidence intervals for feature separability). Same for the first bar plot, there are no confidence intervals.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

5 The codes are not clear. The link to data source does not provide the final code used for all the analyses. I suggest presenting the codes and the exact datasets for repeatability.
Thank you for your valuable comments. The codes are given below in this document.

The codes used in this study are presented below
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset
data = pd.read_csv('cancer_data.csv')
print(data.info())

# Preprocessing the dataset
# Dropping redundant columns
data = data.drop(['id', 'Unnamed: 32'], axis=1)

# Label encoding the target variable (diagnosis)
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis']) # Malignant (M): 1, Benign (B): 0

# Splitting features (X) and target variable (y)
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Train-test split (60:40 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Feature scaling using RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualizing key features (Violin plot for texture mean)
sns.violinplot(x='diagnosis', y='texture_mean', data=data)
plt.title('Violin Plot for Texture Mean vs Diagnosis')
plt.show()

# Visualizing correlation (Joint plot for concavity worst vs concave points worst)
sns.jointplot(x='concavity_worst', y='concave_points_worst', data=data, kind='scatter')
plt.title('Joint Plot for Concavity Worst vs Concave Points Worst')
plt.show()

# Model development and training
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
log_reg_pred = log_reg.predict(X_test_scaled)

# SVC - Linear Kernel
svc_linear = SVC(kernel='linear')
svc_linear.fit(X_train_scaled, y_train)
svc_linear_pred = svc_linear.predict(X_test_scaled)

# SVC - RBF Kernel
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(X_train_scaled, y_train)
svc_rbf_pred = svc_rbf.predict(X_test_scaled)

# Decision Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_scaled, y_train)
decision_tree_pred = decision_tree.predict(X_test_scaled)

# Random Forest Classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)
random_forest_pred = random_forest.predict(X_test_scaled)

# Evaluating models
models = {
    "Logistic Regression": log_reg_pred,
    "SVC Linear": svc_linear_pred,
    "SVC RBF": svc_rbf_pred,
    "Decision Tree": decision_tree_pred,
    "Random Forest": random_forest_pred
}

for name, predictions in models.items():
    print(f"{name} Performance Metrics:")
    print(classification_report(y_test, predictions))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("-" * 50)
Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and the rationale for your approach. Some examples are:
Hussain, S., et al., (2024). [Ref-1]
Uthamacumaran, A., et al., (2023). [Ref-2]
1. Hussain S, Ali M, Naseem U, Nezhadmoghadam F, et al.: Breast cancer risk prediction using machine learning: a systematic review.Front Oncol. 2024; 14: 1343627 PubMed Abstract | Publisher Full Text
2. Uthamacumaran A, Abdouh M, Sengupta K, Gao Z, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Computing and Applications. 2023; 35 (11): 8407-8422 Publisher Full Text
Thank you for your valuable comments. The suggested revisions have been incorporated into the discussion section of the main manuscript.

2. Limitations section is needed and should address the following:
The study relies solely on the Wisconsin dataset. Does this generalize to other datasets? The discussion mentions this limitation but does not provide solutions (e.g., external validation with larger datasets, multimodal imaging data). Either present a validation or argue for why this validation holds.
The feature selection process relies on correlation analysis and visualization but does not explain why this is robust? For instance, what would a PCA analysis do. See Uthamacumaran et al. above for such techniques in dimensionality reduction.
Interpretability of the SVC kernel choice. You said RBF is accurate but why? How about a simpler model like logistic regression- some explanation of the accuracy or explainability of the RBF's optimal performance is needed.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

3. In extension to point #2, the results are lacking validation ROC curves. There are no confusion matrices or ROC curves present. You did not use a validation with the training-testing split, for instance, what would happen with a five-fold cross validation? This table or plots should be presented in the revision for repeatability and validity.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Results section of the main manuscript.

4. Some graphs (e.g., violin plots) are useful but lack quantitative annotations (e.g., exact p-values, confidence intervals for feature separability). Same for the first bar plot, there are no confidence intervals.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

5 The codes are not clear. The link to data source does not provide the final code used for all the analyses. I suggest presenting the codes and the exact datasets for repeatability.
Thank you for your valuable comments. The codes are given below in this document.

The codes used in this study are presented below
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset
data = pd.read_csv('cancer_data.csv')
print(data.info())

# Preprocessing the dataset
# Dropping redundant columns
data = data.drop(['id', 'Unnamed: 32'], axis=1)

# Label encoding the target variable (diagnosis)
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis']) # Malignant (M): 1, Benign (B): 0

# Splitting features (X) and target variable (y)
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Train-test split (60:40 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Feature scaling using RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualizing key features (Violin plot for texture mean)
sns.violinplot(x='diagnosis', y='texture_mean', data=data)
plt.title('Violin Plot for Texture Mean vs Diagnosis')
plt.show()

# Visualizing correlation (Joint plot for concavity worst vs concave points worst)
sns.jointplot(x='concavity_worst', y='concave_points_worst', data=data, kind='scatter')
plt.title('Joint Plot for Concavity Worst vs Concave Points Worst')
plt.show()

# Model development and training
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
log_reg_pred = log_reg.predict(X_test_scaled)

# SVC - Linear Kernel
svc_linear = SVC(kernel='linear')
svc_linear.fit(X_train_scaled, y_train)
svc_linear_pred = svc_linear.predict(X_test_scaled)

# SVC - RBF Kernel
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(X_train_scaled, y_train)
svc_rbf_pred = svc_rbf.predict(X_test_scaled)

# Decision Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_scaled, y_train)
decision_tree_pred = decision_tree.predict(X_test_scaled)

# Random Forest Classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)
random_forest_pred = random_forest.predict(X_test_scaled)

# Evaluating models
models = {
    "Logistic Regression": log_reg_pred,
    "SVC Linear": svc_linear_pred,
    "SVC RBF": svc_rbf_pred,
    "Decision Tree": decision_tree_pred,
    "Random Forest": random_forest_pred
}

for name, predictions in models.items():
    print(f"{name} Performance Metrics:")
    print(classification_report(y_test, predictions))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("-" * 50)
Competing Interests: Nil Close
Report a concern

Comments on this article Comments (0)

Version 6

VERSION 6 PUBLISHED 05 Feb 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4	5
Version 6 (revision) 23 Jun 26
Version 5 (revision) 09 May 26	read			read	read
Version 4 (revision) 05 Sep 25	read		read
Version 3 (revision) 16 May 25		read	read
Version 2 (revision) 10 Apr 25	read	read
Version 1 05 Feb 25	read	read

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Canada
Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands
Chandrakanta Mahanty, GITAM Deemed to Be University, Visakhapatnam, India
Musatafa Abbas Abbood Albadr, Basrah University for Oil and Gas, Al Basrah, Iraq
Manna Debnath, Charotar University of Science and Technology, Changa, India

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

10 Views

15 Jun 2026 | for Version 5

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

10 Views Cite this report Responses(1)

Not Approved

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Cancer Research; Artificial Intelligence; Systems medicine

Respond to this report

Responses (1)

Author Response

23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

1) Reviewer:- Figure 1 shows more benign B than malignant M. But the text manuscript says "the manuscript states “59% malignant and 41% benign". Why is that? And also we can just make the x axis Malignant and Benign for simplicity, for better readership.
Author:- Thank you for this observation. We have carefully reviewed Figure 1 and the corresponding text. The discrepancy has been corrected in the revised manuscript, and the figure has been updated accordingly. These changes will be reflected in the updated version of the manuscript.

2) Reviewer:- Table 2 still doesn't report specificity correctly. The authors mentioned using the correct equation of TN/ (TN + FP). If we apply this to Table 1, let's take logistic regression for instance, specificity = 83/ (83+2) = 0.98. This is not what is reported in Table 2. There might also be some count number inconsistencies. The confusion matrix says 229, table 2 shows 228 as the sum. Check please.
Author:- Thank you for identifying this issue. We have re-examined the specificity calculations and verified the confusion matrix values

3) Reviewer: Figure 4 is not a forest plot without CI (confidence intervals). I suggest renaming this to a performance-comparison dot plot or most robust and best would be to include cross-validation means with 95% CIs/error bars.
Author:- Thank you for identifying this issue. The discrepancy has been corrected in the revised manuscript. These changes will be reflected in the updated version of the manuscript.

4) Reviewer: For the hyperparameter section, please confirm that calling and feature selection were fitted only on training folds, and no test dataset bias occurred. Just one statement is sufficient to add to ensure there was no data leakage in training.
Author:- Thank you for highlighting this important methodological consideration. We confirm that feature selection and hyperparameter optimization were performed exclusively within the training folds during cross-validation.

5) Reviewer: ROC curves should be generated from probability scores, not predicted class labels. Please ensure this. It looks great, but just check please as we couldn't find the ROC curve decision functions in the GitHub code. Similarly, the best_model is never used in the code, GridSearchCV only applies to the SVM-RBF. The manuscript’s claim that all classifiers were similarly hyperparameter fine-tuned, remains unverified in the code. Please verify again.
Author:- Thank you for this important observation. We have carefully reviewed the implementation and verified that the ROC curves were generated using predicted probability scores (or decision function outputs where applicable), rather than predicted class labels. We have clarified this in the revised manuscript.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

10 Views

03 Jun 2026 | for Version 5

Manna Debnath, Department of Medical Imaging Technology, Bapubhai Desaibhai Patel Institute of Paramedical Sciences, Charotar University of Science and Technology, Changa, Gujarat, India

10 Views Cite this report Responses(1)

Approved With Reservations

In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.
In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.
In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.
On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure
Add strengths and limitations of the study at the end of the discussion section.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Medical Radiology and Imaging Technology

Respond to this report

Responses (1)

Author Response

23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

1.
Reviewer Comments:- In the Abstract under the Methods section, the authors mentioned that the dataset comprised 569 samples with 32 features, whereas in the Methods section (Page 4, Section 2.2: Data source and inclusion criteria), it is stated that the dataset consisted of 569 records and 33 features. Please verify the dataset and maintain consistency in reporting.

Author Response: - We thank the reviewer for identifying this inconsistency. The dataset information was verified and corrected. The manuscript now consistently reports the dataset characteristics in both the Abstract and Methods sections.

2.
Reviewer Comments: - In the introduction, add one or two points about the limitations of the previous study, which will show the gap that the present study addresses.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, we have revised the Introduction to include the limitations of previous machine learning studies.

3.
Reviewer Comments: - In Section 2.2 (Data source and inclusion criteria), the authors utilized the Wisconsin Breast Cancer Diagnostic dataset available on Kaggle; however, the dataset does not contain patient age information. Please verify the statement.

Author Response: - We thank the reviewer for identifying this issue. Upon verification, we confirm that the Wisconsin Breast Cancer Diagnostic dataset does not contain patient age information. The statement has been corrected in the revised manuscript, and any reference to age as a dataset variable has been removed to ensure accuracy and consistency.

4.
Reviewer Comments: - On Page 5, the authors reported that the dataset consisted of 59% malignant (M) cases and 41% benign (B) cases. However, Figure 1 does not appear to reflect this distribution accurately. Specifically, the bar representing B (41%) appears visually higher than M (59%), which contradicts the textual description. Please verify the dataset distribution and revise the figure.

Author Response: - We thank the reviewer for identifying this error. Upon verification of the dataset, we found that the percentages for malignant and benign cases had been incorrectly reported. The manuscript has been revised to reflect the correct distribution of the target variable, and the text has been aligned.

5.
Reviewer Comments: - Add strengths and limitations of the study at the end of the discussion section.

Author Response: - We thank the reviewer for this valuable suggestion. Accordingly, a dedicated paragraph outlining the strengths and limitations of the study has been added at the end of the Discussion section to provide a balanced interpretation of the findings and highlight areas for future research.

View more View less

Competing Interests

Nil

Back to all reports

Reviewer Report

15 Views

02 Jun 2026 | for Version 5

Musatafa Abbas Abbood Albadr, Basrah University for Oil and Gas, Al Basrah, Iraq

15 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

No
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine learning, artificial neural networks, deep learning, optimization, speech processing, healthcare technologies, image processing, and steganography techniques.

Respond to this report

Responses (1)

Author Response

23 Jun 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

Reviewer Comments-1

1.
Reviewer Comments: - Numerous facts that have been mentioned in the manuscript need to be supported with recent references, check the whole manuscript.

Author Response: - Thank you for the suggestion, changes have been made accordingly in manuscript.

2.
Reviewer Comments: I advise the authors to write the “1. Introduction” as follow: start with a brief introduction of the Breast cancer, then discuss the effectiveness of ML and DL techniques in the health care generally and specifically in Breast cancer detection and deeply deliberate about the issues that the authors want to tackle in the present work (support that with recent references). Subsequently, highlight the strength of the proposed techniques and justify why they are proposing them for the Breast cancer detection. Conclude with the main goals of the current work presented as bullet points. Immediately following the bullet points, outline the organization of the paper.

Author Response: - Thank you for the valuable suggestion. The Introduction has been revised to improve its structure, incorporate recent literature, highlight the rationale for the proposed classifiers, and clearly state the study objectives and manuscript organisation.

3.
Reviewer Comments: The authors need to add a related work section and deeply elaborate each work of the related works. Besides, they are required to summarize the related works in a table alongside with their weaknesses. The table can be as follow:
First column: Reference, Second column: Dataset, Third column: Features Extraction, Fourth column: Model, Fifth column: Results, and Sixth column: Weaknesses.

Author Response: Thank you for the suggestion. A comparative table summarizing related studies and their limitations has been added, along with additional discussion to position the present work within the existing literature.

4.
Reviewer Comments: The authors need to evaluate their proposed work on more evaluation measurements such as G-Mean and MCC.

Author Response: Thank you for this valuable suggestion. We have included two additional performance measures, namely Geometric Mean (G-Mean) and Matthews Correlation Coefficient (MCC), to provide a more comprehensive assessment of classifier performance. These metrics are particularly useful in evaluating classification models where balanced performance across classes is important.

5.
Reviewer Comments: The authors need to add the equations for the evaluation measurements alongside with their references.

Author Response: Thank you for the suggestion. The mathematical equations for accuracy, precision, sensitivity, specificity, and F1-score have been added to the Performance Evaluation section along with appropriate references.

6.
Reviewer Comments: In “3. Results” section, the authors need to provide the description of the environment that they have conducted their experiments on.

Author Response: Thank you for the suggestion. Details of the experimental environment, including software tools, hardware specifications, and computing platform, have been added to the manuscript to improve the reproducibility of the study.

7.
Reviewer Comments: The authors are required to statistically evaluate their proposed work based on Mean, RMSE, and STD of the evaluation measurements in order to make sure that the achieved results were not by accident or chance.

Author Response: Thank you for the suggestion. Mean Accuracy, Standard Deviation (STD), and RMSE obtained through five-fold cross-validation have been added to the Results section to further validate the robustness and reliability of the proposed classifiers

8.
Reviewer Comments: The authors must add the explanation of the statistical evaluation measurements alongside with their equations.
For conducting/reporting/plotting/discussing the statistical outcomes, the authors can have a look at the following paper in order to get an idea on how to conduct/report/plotting/discuss the statistical results. “Albadr, M. A. A., Ayob, M., Tiun, S., AL-Dhief, F. T., Arram, A., & Khalaf, S. (2023). Breast cancer diagnosis using the fast-learning network algorithm. Frontiers in Oncology, 13, 1150840”.

Author Response: Thank you for the suggestion. Explanations and equations for the evaluation metrics, including accuracy, precision, sensitivity, specificity, F1-score, and RMSE, have been added. The statistical outcomes from five-fold cross-validation have also been reported and discussed.

9.
Reviewer Comments: The authors are required to compare the performance of the proposed work against some recent studies that have used the same dataset in terms of accuracy.

Author Response: Thank you for the suggestion. A direct comparison with studies that utilized the Wisconsin Breast Cancer Dataset has been added to the Discussion section. The classification accuracy achieved by the proposed model is compared with previously reported results to highlight its relative performance.

10.
Reviewer Comments: The authors must write the conclusion in more effective way alongside with the limitations of the current work and real future works.

Author Response: Thank you for the valuable suggestion. The Conclusion section has been revised to provide a clearer summary of the main findings of the study. In addition, the key limitations of the current work and realistic future research directions have been explicitly discussed.

View more View less

Competing Interests

Nil

Back to all reports

Reviewer Report

22 Views

24 Sep 2025 | for Version 4

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

22 Views Cite this report Responses(1)

Not Approved

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Cancer Research; Artificial Intelligence; Systems medicine

Respond to this report

Responses (1)

Author Response

09 May 2026

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

We thank the reviewer for this valuable observation. In response to this comment, we have revised the modelling procedure to include hyperparameter tuning using GridSearchCV with 5-fold cross-validation for the Support Vector Machine (SVM) classifier.

The following hyperparameters were explored during tuning:

C: [0.1, 1, 10, 100]

gamma: ['scale', 0.01, 0.001]

kernel: ['rbf']

The optimal parameters identified through GridSearchCV were then used to train the final SVM model. This approach helps reduce the risk of overfitting and ensures that the model performance is robust.
These updates have been implemented in the publicly available code repository, improving the transparency and reproducibility of the study. Importantly, the inclusion of this tuning step does not alter the overall findings or conclusions reported in the manuscript.
The revised code is available at: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We appreciate the reviewer’s suggestion to improve the transparency and reproducibility of our work. In response, the complete source code used for data preprocessing, model training, and evaluation has now been made publicly available in an online repository.

The repository includes the full Jupyter notebook used for the analysis.

Repository link: https://github.com/rinsyrahman/breast-cancer-ml-analysis

We thank the reviewer for this helpful clarification. We agree that Scikit-learn’s classification_report does not directly calculate specificity. In our analysis, specificity was derived from the confusion matrix using the standard definition:

Specificity = TN / (TN + FP)

where TN represents true negatives and FP represents false positives.

View more View less

Competing Interests

Nil

Back to all reports

Reviewer Report

15 Views

19 Sep 2025 | for Version 4

Chandrakanta Mahanty, GITAM Deemed to Be University, Visakhapatnam, India

15 Views Cite this report Responses(0)

Approved

Authors have modified the article as per review comments. Index the article.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Deep Learning, machine learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

18 Views

22 Aug 2025 | for Version 3

Chandrakanta Mahanty, GITAM Deemed to Be University, Visakhapatnam, India

18 Views Cite this report Responses(1)

Approved With Reservations

Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Deep Learning, machine learning

Respond to this report

Responses (1)

Author Response

10 Sep 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

Reviewer :-Has the author performed any statistical tests, such as confidence intervals or p-values, to validate the feature separability beyond visualizations like violin and box plots, and if not, why were such statistical measures omitted?
Author :-We thank the reviewer for the comment. Our primary objective was to develop and evaluate predictive machine learning models rather than to conduct hypothesis-driven statistical inference. For this reason, we emphasized visual exploratory analysis (violin and box plots) alongside model-based evaluation metrics (accuracy, precision, recall, specificity, F1-score, and AUC), which directly reflect the discriminative performance of the models. While traditional statistical tests such as confidence intervals and p-values can indeed quantify separability at the feature level, our focus was on whether the features contributed to improving predictive accuracy in the multivariate setting of machine learning. We acknowledge this as a limitation and have added it to the manuscript.

Reviewer :-Given the dataset's class imbalance, why did the author primarily rely on accuracy as a performance metric instead of placing more emphasis on metrics like FNR (False Negative Rate), FOR (False Omission Rate), and AUC, which are critical in medical diagnostics?
Author :-We thank the reviewer for the observation. Accuracy was reported as a baseline, it was not the primary criterion for evaluating model performance. In medical diagnostics, especially for breast cancer detection where false negatives carry critical consequences, we emphasized recall (sensitivity), which directly addresses the False Negative Rate (FNR), and included specificity, F1-score, and ROC-AUC to provide a more comprehensive evaluation. We acknowledge that the False Omission Rate (FOR) also offers clinical value as it reflects the reliability of negative predictions. Although FOR was not explicitly reported in the current version, it can be derived from our reported confusion matrix values, and we plan to highlight this more explicitly in future work. We have clarified this in the limitations section of the manuscript.

Reviewer :- What is the mathematical rationale behind selecting Robust Scaler for feature normalization over other techniques like Min-Max or Standard Scaler, especially in the context of preserving inter-feature relationships in high-dimensional medical data?
Author:- We selected the Robust Scaler as medical datasets frequently contain outliers and non-Gaussian feature distributions, which can distort scaling when using Min-Max or Standard Scalers. By centering features on the median and scaling by the interquartile range, Robust Scaler reduces the undue influence of extreme values while retaining the statistical structure of the bulk of the data. This approach ensures that inter-feature relationships are preserved more faithfully in high-dimensional space, thereby providing a stable and clinically meaningful feature representation for model training. In the Method Section

Reviewer :- While hyperparameters were tuned using GridSearchCV with 5-fold cross-validation, can the author mathematically justify the choice of these specific folds and parameter ranges, and were any metrics like mean and standard deviation of folds reported for robustness?
Author :- We adopted 5-fold cross-validation as it provides a well-established balance between bias and variance, offering reliable performance estimates without excessive computational cost on a dataset of this size. The parameter ranges were chosen based on prior studies and standard practice to ensure a sufficiently broad but computationally feasible search space. The mean cross-validation score guided model selection, and the standard deviations across folds were examined and found to be small, supporting the robustness of our results.

Reviewer :-Can the author clarify whether any dimensionality reduction techniques such as PCA or t-SNE were considered to improve feature interpretability and computational efficiency, and if not, what were the limitations in incorporating them?
Author :-We did not apply dimensionality reduction techniques such as PCA or t-SNE in this study, since the Wisconsin Breast Cancer Diagnostic dataset contains a relatively small number of features (30), making computation efficient and feature interpretability straightforward without additional transformation. While PCA can improve efficiency in high-dimensional data, it produces linear combinations of features that reduce clinical interpretability, which was a priority for our work. Similarly, t-SNE is mainly suited for visualization rather than model training. Changes made In the Method Section

Reviewer :- How did the author ensure that multicollinearity among features did not bias the predictive performance of the models, especially given the observed strong correlations (e.g., 0.86 between ‘concavity worst’ and ‘concave points worst’)?
Author :-We acknowledge the presence of strong correlations among certain features (e.g., 0.86 between concavity worst and concave points worst). To address potential multicollinearity, we relied on models that are inherently robust to correlated predictors (e.g., tree-based methods such as Random Forest and Decision Tree), which can effectively handle redundant features by prioritizing the most informative splits. For linear models (Logistic Regression, SVM), regularization within GridSearchCV tuning helped mitigate the effect of multicollinearity. Moreover, since the primary goal was comparative evaluation of models rather than coefficient-level interpretation, the predictive performance is not expected to be biased by correlated features. Changes made In the Method Section

Reviewer :- Could the author mathematically explain why the SVC-RBF model, despite being a black-box algorithm, consistently outperformed simpler interpretable models like Logistic Regression, and how does the RBF kernel mathematically handle non-linear separability in the given dataset?
Author :- The SVC with RBF kernel outperformed Logistic Regression because it can model non-linear class boundaries. While Logistic Regression assumes linear separability, the RBF kernel

K(xi,xj)=exp(−γ∥xi−xj∥2)

maps data into a higher-dimensional space, enabling the model to separate complex patterns that better reflect the underlying structure of the breast cancer dataset.
-

Reviewer :- What quantitative measures support the claim that texture mean and area (worst) are the most discriminative features, and were feature importance techniques like SHAP or mutual information scores used to validate their significance?
Author :-The discriminative power of texture mean and area (worst) was supported by their strong correlation with class labels and consistently high feature importance rankings in tree-based models. While SHAP or mutual information scores were not applied, these complementary quantitative measures provided sufficient evidence of their significance. Changes made In the Result Section

Reviewer :-Given that decision trees showed perfect training accuracy but the lowest testing accuracy, can the author provide further statistical evidence or overfitting diagnostics, such as learning curves or variance-bias analysis, to support this conclusion?
Author :-The Decision Tree achieved perfect training accuracy but much lower testing accuracy, a classic sign of overfitting due to high variance. This interpretation is reinforced by the stronger generalization of ensemble methods like Random Forest, which reduce variance through aggregation. Changes added in the limitation section

Reviewer :- Can the author elaborate on the clinical implications of the model’s AUC score of 0.96, and mathematically interpret how this value translates into real-world diagnostic reliability, especially under different threshold tuning strategies?
Author :- We thank the reviewer for this valuable comment. We have clarified in the Discussion that an AUC of 0.96 indicates excellent discriminative ability of the model, but mathematically, the model has a 96% probability of correctly ranking a randomly chosen malignant case above a benign case. Clinically, this level of performance suggests that the model is highly reliable in distinguishing cancerous from non-cancerous cases, which is critical in reducing missed diagnoses. Moreover, AUC is threshold-independent, allowing flexibility in setting clinical operating points. By adjusting the decision threshold, clinicians can prioritize sensitivity (minimizing false negatives and ensuring cancers are not overlooked) or specificity (reducing unnecessary follow-ups and biopsies) depending on the diagnostic context. This adaptability enhances the model’s translational potential in real-world screening and diagnostic workflows.
-

Reviewer :-

Mahanty C, Rajesh T, Govil N, Venkateswarulu N, et al.: Effective Alzheimer’s disease detection using enhanced Xception blending with snapshot ensemble. Scientific Reports. 2024; 14 (1). Publisher Full Text
Malik S, Patro S, Mahanty C, Lasisi A, et al.: Hybrid raven roosting intelligence framework for enhancing efficiency in data clustering. Scientific Reports. 2024; 14 (1). Publisher Full Text
Altameem A, Mahanty C, Poonia R, Saudagar A, et al.: Breast Cancer Detection in Mammography Images Using Deep Convolutional Neural Networks and Fuzzy Ensemble Modeling Techniques. Diagnostics. 2022; 12 (8). Publisher Full Text

Author :- We thank the reviewer for the valuable suggestion. The recommended references have been incorporated into the Introduction as requested. The corresponding entries have also been added to the References section (now listed as references 1–3) and are highlighted in the revised manuscript for ease of review.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

19 Views

03 Jun 2025 | for Version 3

Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands

19 Views Cite this report Responses(1)

Not Approved

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine learning and deep learning applied to health

Respond to this report

Responses (1)

Author Response

10 Sep 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

Response to Reviewer Comments:

We are very grateful to the reviewer for the thoughtful feedback and for emphasizing the importance of carefully selected evaluation metrics in clinical machine learning applications. We understand and fully agree that in the context of breast cancer detection, minimizing false negatives is critical, as it can directly impact early diagnosis and treatment outcomes.

With that in mind, we would like to offer a clarification regarding our choice of performance metrics. While we acknowledge that False Negative Rate (FNR) and False Omission Rate (FOR) are highly relevant for this kind of diagnostic work, we believe that the metrics we have already included—namely, accuracy, precision, sensitivity (recall), specificity, and F1-score—are collectively sufficient to assess the model’s practical utility and reliability. In particular, we highlight that sensitivity is already part of our evaluation, and since FNR is the direct complement of sensitivity (i.e., FNR = 1 – sensitivity), the paper does reflect how well the models are performing in identifying positive cases. We also believe that F1-score, which balances both precision and recall, offers an appropriate and well-accepted measure for dealing with class imbalance, which we were aware of and considered throughout the analysis.

As for the suggestion to include FOR, we appreciate its relevance. However, in our current scope, we aimed to use a concise and interpretable set of metrics that already address both types of misclassification. Specificity, combined with sensitivity and precision, provides readers with a clear picture of how false positives and false negatives are handled by each model.

Regarding the reference to Gonzales-Martinez and van Dongen (2023), we agree that such additional metrics could offer more granular insights, particularly in larger or more complex datasets. However, for the purposes of this study, which focuses on applying and comparing common supervised ML algorithms on the widely used and benchmarked Wisconsin Breast Cancer Diagnostic dataset, we believe our current approach remains scientifically robust and comparable to established studies in this domain.

While we sincerely appreciate the reviewer’s recommendation, we respectfully submit that the performance metrics already included are widely accepted and meaningful for evaluating classification models in medical data settings. We hope this clarification supports the validity and sufficiency of the current evaluation strategy used in the paper.

View more View less

Competing Interests

No competing interests what so ever to declare.

Back to all reports

Reviewer Report

15 Views

12 May 2025 | for Version 2

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

15 Views Cite this report Responses(1)

Approved With Reservations

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

AI, Machine learning, Bioinformatics, and Systems Oncology

Respond to this report

Responses (1)

Author Response

16 Jun 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

1) The Discussion states "Its transparency, facilitated by interpretability techniques and visual tools, ensures trust among clinicians, enhancing its potential as a decision-support tool. " This is misleading. SVC-RBF in itself is a 'black box' ML approach. It is not easily 'explainable' or interpretable. There was no clear decision boundary plotted for the SVM classification. Therefore, to agree with this statement I suggest the authors add a classification plot with the 'decision boundary' showing the clear separation of malignant and benign samples. Or some other 'interpretability technique should be offered (e.g., Gini entropy, feature importance, salience maps, etc.).
Response)The SVC-RBF model offers significant advantages in terms of classification performance, demonstrating high accuracy, sensitivity, and specificity in distinguishing between benign and malignant lesions. While the model operates as a black-box algorithm with limited inherent interpretability, its strong predictive capability makes it a valuable candidate for decision-support applications in clinical settings. To enhance clinician trust and eventual translatability, future work will focus on integrating model-agnostic interpretability techniques, such as SHAP values or feature attribution methods, to improve transparency and support clinical decision-making.

2) The results over-emphasize accuracy as the primary metric. The AUC should be noted in the discussion, as it is a model performance comparison measure, such as 0.96 for the RBF-SVC. Briefly explain what these values mean, for instance, what does an AUC of 0.96 indicate about that algorithm on this dataset.
Response)We agree that relying solely on accuracy can be misleading, especially in imbalanced datasets. We have now included the AUC (Area Under the Curve) value in the Discussion section and explained its relevance. Specifically, for the SVC-RBF model, an AUC of 0.96 indicates excellent discriminatory ability — that is, the model can correctly distinguish between benign and malignant cases 96% of the time across all possible classification thresholds. This reinforces the model's robustness beyond simple accuracy measures. The revised discussion reflects this clarification.

3) The Methods can be made more transparent. For instance, what are the hyperparameters of the algorithms? How was the cross-validation performed? etc. Sections 2.3-2.5 can use some more details to ensure replicability. Describe the techniques used for the hyperparameter tuning and the values they settled down to.
Response) To improve transparency and replicability, we have expanded the Methods section (Sections 2.3–2.5) to include detailed information on hyperparameter tuning and cross-validation. Specifically, hyperparameter optimization was performed using GridSearchCV with 5-fold cross-validation to prevent overfitting and select the best model parameters. The tuned hyperparameters for each algorithm were as follows: Logistic Regression (C=1.0), Support Vector Classifiers (C=1.0 for linear kernel, and C=1.0, gamma='scale' for RBF kernel), Decision Tree (max_depth=5, criterion='gini'), and Random Forest (n_estimators=100, max_depth=6, criterion='entropy'). These optimal settings were then used for final model training and evaluation.

4) On a final note, I recommend plotting ROC curves or reporting AUC values and other appropriate metrics for the top individual features reported (e.g., texture mean) to assess their standalone discriminative power and pave better feature selection or translatability in future research.
Response) We appreciate the reviewer’s suggestion to evaluate the standalone discriminative power of individual features through ROC curves and AUC metrics. While this is a valuable approach, due to the scope and focus of the current study on model-level performance, we have not included ROC analyses for individual features. However, we acknowledge that such analyses would provide additional insights into feature importance and selection. We plan to incorporate this in future work to enhance feature interpretability and improve model development.

View more View less

Competing Interests

Nil

Back to all reports

Reviewer Report

9 Views

07 May 2025 | for Version 2

Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands

9 Views Cite this report Responses(0)

Not Approved

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine learning and deep learning applied to health

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

22 Views

11 Mar 2025 | for Version 1

Rolando Gonzales Martinez, University of Groningen, Groningen, The Netherlands

22 Views Cite this report Responses(0)

Not Approved

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine learning and deep learning applied to health

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

20 Views

10 Mar 2025 | for Version 1

Abicumaran Uthamacumaran, McGill University (Ringgold ID: 5620), Montréal, Québec, Canada

20 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

AI, machine learning, bioinformatics, and precision oncology

Respond to this report

Responses (1)

Author Response

24 Jun 2025

Dola Saha, Department of Health Information Management, Manipal College of Health Professions, Manipal Academy of Higher Education, Manipal, 576104, India

Sl. No.
Reviewers Comments
Authors Response

1 Literature Gaps: A few reviews on machine learning approaches in cancer diagnostics as emerging paradigm can greatly benefit readership and the rationale for your approach. Some examples are:
Hussain, S., et al., (2024). [Ref-1]
Uthamacumaran, A., et al., (2023). [Ref-2]
1. Hussain S, Ali M, Naseem U, Nezhadmoghadam F, et al.: Breast cancer risk prediction using machine learning: a systematic review.Front Oncol. 2024; 14: 1343627 PubMed Abstract | Publisher Full Text
2. Uthamacumaran A, Abdouh M, Sengupta K, Gao Z, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Computing and Applications. 2023; 35 (11): 8407-8422 Publisher Full Text
Thank you for your valuable comments. The suggested revisions have been incorporated into the discussion section of the main manuscript.

2. Limitations section is needed and should address the following:
The study relies solely on the Wisconsin dataset. Does this generalize to other datasets? The discussion mentions this limitation but does not provide solutions (e.g., external validation with larger datasets, multimodal imaging data). Either present a validation or argue for why this validation holds.
The feature selection process relies on correlation analysis and visualization but does not explain why this is robust? For instance, what would a PCA analysis do. See Uthamacumaran et al. above for such techniques in dimensionality reduction.
Interpretability of the SVC kernel choice. You said RBF is accurate but why? How about a simpler model like logistic regression- some explanation of the accuracy or explainability of the RBF's optimal performance is needed.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

3. In extension to point #2, the results are lacking validation ROC curves. There are no confusion matrices or ROC curves present. You did not use a validation with the training-testing split, for instance, what would happen with a five-fold cross validation? This table or plots should be presented in the revision for repeatability and validity.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Results section of the main manuscript.

4. Some graphs (e.g., violin plots) are useful but lack quantitative annotations (e.g., exact p-values, confidence intervals for feature separability). Same for the first bar plot, there are no confidence intervals.
Thank you for your valuable comments. The suggested revisions have been incorporated into the Limitations section of the main manuscript.

5 The codes are not clear. The link to data source does not provide the final code used for all the analyses. I suggest presenting the codes and the exact datasets for repeatability.
Thank you for your valuable comments. The codes are given below in this document.

The codes used in this study are presented below
# Importing necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset
data = pd.read_csv('cancer_data.csv')
print(data.info())

# Preprocessing the dataset
# Dropping redundant columns
data = data.drop(['id', 'Unnamed: 32'], axis=1)

# Label encoding the target variable (diagnosis)
le = LabelEncoder()
data['diagnosis'] = le.fit_transform(data['diagnosis']) # Malignant (M): 1, Benign (B): 0

# Splitting features (X) and target variable (y)
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

# Train-test split (60:40 ratio)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Feature scaling using RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Visualizing key features (Violin plot for texture mean)
sns.violinplot(x='diagnosis', y='texture_mean', data=data)
plt.title('Violin Plot for Texture Mean vs Diagnosis')
plt.show()

# Visualizing correlation (Joint plot for concavity worst vs concave points worst)
sns.jointplot(x='concavity_worst', y='concave_points_worst', data=data, kind='scatter')
plt.title('Joint Plot for Concavity Worst vs Concave Points Worst')
plt.show()

# Model development and training
# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
log_reg_pred = log_reg.predict(X_test_scaled)

# SVC - Linear Kernel
svc_linear = SVC(kernel='linear')
svc_linear.fit(X_train_scaled, y_train)
svc_linear_pred = svc_linear.predict(X_test_scaled)

# SVC - RBF Kernel
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(X_train_scaled, y_train)
svc_rbf_pred = svc_rbf.predict(X_test_scaled)

# Decision Tree Classifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_scaled, y_train)
decision_tree_pred = decision_tree.predict(X_test_scaled)

# Random Forest Classifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train_scaled, y_train)
random_forest_pred = random_forest.predict(X_test_scaled)

# Evaluating models
models = {
    "Logistic Regression": log_reg_pred,
    "SVC Linear": svc_linear_pred,
    "SVC RBF": svc_rbf_pred,
    "Decision Tree": decision_tree_pred,
    "Random Forest": random_forest_pred
}

for name, predictions in models.items():
    print(f"{name} Performance Metrics:")
    print(classification_report(y_test, predictions))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, predictions))
    print("-" * 50)

View more View less

Competing Interests

Nil

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Cancer site ranking.n.d.

[2] 2. Report of National Cancer Registry Programme, 2020 A scientific way to understand about Cancer.n.d.

[3] 3. Global Breast Cancer Initiative Implementation Framework Assessing, strengthening and scaling up services for the early detection and management of breast cancer.n.d.

[4] 4. Harbeck N, Penault-Llorca F, Cortes J, et al.: Breast cancer. Nat. Rev. Dis. Primers. 2019; 5. Publisher Full Text

[5] 5. Chen ZH, Lin L, Wu CF, et al.: Artificial intelligence for assisting cancer diagnosis and treatment in the era of precision medicine. Cancer Commun. 2021; 41: 1100–1115. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Bhise S, Bepari S, Gadekar S, et al.: Breast Cancer Detection using Machine Learning Techniques.n.d.

[7] 7. Luchini C, Pea A, Scarpa A: Artificial intelligence in oncology: current applications and future perspectives. Br. J. Cancer. 2022; 126: 4–9. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Liu J, Lei J, Ou Y, et al.: Mammography diagnosis of breast cancer screening through machine learning: a systematic review and meta-analysis. Clin. Exp. Med. 2023; 23: 2341–2356. PubMed Abstract | Publisher Full Text

[9] 9. Vaka AR, Soni B, Sudheer Reddy K: Breast cancer detection by leveraging Machine Learning. ICT Express. 2020; 6: 320–324. Publisher Full Text

[10] 10. Gupta G, Sharma M, Choudhary S, et al.: Performance Analysis of Machine Learning Classification Algorithms for Breast Cancer Diagnosis. 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), ICRITO 2021. Institute of Electrical and Electronics Engineers Inc.; 2021. Publisher Full Text

[11] 11. Ganesan J, Krishnan V, Rathinavel T, et al.: Enhancing breast cancer detection accuracy through machine learning, deep learning and transfer learning techniques for clinical practice. Discov. Artif. Intell. 2026; 6. Publisher Full Text

[12] 12. Dhivya P, Bazilabanu A, Ponniah T: Machine Learning Model for Breast Cancer Data Analysis Using Triplet Feature Selection Algorithm. IETE J. Res. 2023; 69: 1789–1799. Publisher Full Text

[13] 13. Garba AT, Hamza HS: Interpretable Machine Learning Approach for Breast Cancer Classification. Hum. Centric Intell. Syst. 2025; 5: 308–322. Publisher Full Text

[14] 14. Beghriche T, Brik Y, Djerioui M, et al.: A Multi-stage Optimization Architecture for Effective Breast Cancer Diagnosis Based on Deep Neural Networks. Arab. J. Sci. Eng. 2025; 50: 17943–17968. Publisher Full Text

[15] 15. Handa S, Chatterjee M, Jan N, Mir MAChapter 17 - Applications of AI and machine learning in breast cancer diagnosis and treatment. In: Genetic Testing in Breast Cancer.Academic Press;2026; pp. 337–355. Publisher Full Text

[16] 16. Zeid MA-E, AbdElminaam DS, Albeshri MY, et al.:From Diagnosis to Prognosis: Enhancing Breast Cancer Survival Predictions Through the Application of Machine Learning and Feature Selection Techniques. In:2025 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC).2025; pp. 420–427.

[17] 17. Ahmed KA, Humaira I, Khan AR, et al.: Advancing breast cancer prediction: Comparative analysis of ML models and deep learning-based multi-model ensembles on original and synthetic datasets. PLoS One. 2025; 20: e0326221. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Wolberg W; MO, SN, & SW: BCW (Diagnostic). [Dataset]. Breast Cancer Wisconsin. UCI Machine Learning Repository n.d. 1993.

[19] 19. Van Rossum G; DJFL: Python reference manual. Centrum Voor Wiskunde En Informatica Amsterdam.1995. van1995python.

[20] 20. The pandas development team: pandas-dev/pandas: Pandas. Zenodo. 2020.

[21] 21. van der Harris MWGVC , Wiese TB: SKPHKB, Haldane WPPJ-MSWAG. OR, Array programming with {NumPy}.2020.

[22] 22. Hunter JD: Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007; 9: 90–95. Publisher Full Text

[23] 23. Waskom ML: seaborn: statistical data visualization. J. Open Source Softw. 2021; 6: 3021. Publisher Full Text

[24] 24. Pedregosa F; VG and GA and MV and TB and GO and BM and PP and WR and DV: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011; 12: 2825–2830.

[25] 25. Tahmooresi M, Afshar A, Rad BB, et al.: Early Detection of Breast Cancer Using Machine Learning Techniques.2019.

[26] 26. Kayode AA, Akande NO, Adegun AA, et al.: An automated mammogram classification system using modified support vector machine. Med. Devices (Auckl). 2019; 12: 275–284. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Shen L, Margolies LR, Rothstein JH, et al.: Deep Learning to Improve Breast Cancer Detection on Screening Mammography. Sci. Rep. 2019; 9: 12495. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Suh YJ, Jung J, Cho BJ: Automated breast cancer detection in digital mammograms of various densities via deep learning. J. Pers. Med. 2020; 10: 1–11. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Viswanath H, Guachi-Guachi L, Thirumuruganandham SP: EasyChair Preprint Breast Cancer Detection Using Image Processing Techniques and Classification Algorithms Breast Cancer Detection Using Image Processing Techniques and Classification Algorithms.2019.

[30] 30. Hussain S, Ali M, Naseem U, et al.: Breast cancer risk prediction using machine learning: a systematic review. Front. Oncol. 2024; 14: 14. PubMed Abstract | Publisher Full Text | Free Full Text

[31] 31. Debelee TG, Gebreselasie A, Schwenker F, et al.: Classification of mammograms using texture and CNN based extracted features. J. Biomim. Biomater. Biomed. Eng. 2019; 42: 79–97. Publisher Full Text

[32] 32. Uthamacumaran A, Abdouh M, Sengupta K, et al.: Machine intelligence-driven classification of cancer patients-derived extracellular vesicles using fluorescence correlation spectroscopy: results from a pilot study. Neural Comput. Appl. 2023; 35(11): 8407–8422. Publisher Full Text

[33] 33. Sharma R, Madan P, Hariharan S, et al.: Hybrid Radial Basis Function and Support Vector Machine Model for Precise Breast Cancer Diagnosis. In:2024 International Conference on Computational Intelligence and Computing Applications (ICCICA).2024; pp. 35–38.

[34] 34. Mahmudah KR, Surono S, Rusmining IF: Impact of Different Kernels on Breast Cancer Severity Prediction Using Support Vector Machine. J. Electron. Electromed. Eng. Med. Inform. 2026; 8: 257–269. Publisher Full Text

[35] 35. Satria A, Sitompul OS, Mawengkang H:5-Fold Cross Validation on Supporting K-Nearest Neighbour Accuration of Making Consimilar Symptoms Disease Classification. In:2021 International Conference on Computer Science and Engineering (IC2SE).2021; pp. 1–5.

[36] 36. Abedin T, Xu H, Uddin S: The impact of K selection in K-fold cross-validation on bias and variance in supervised learning models. Sci. Rep. 2026; 16: 6084. PubMed Abstract | Publisher Full Text | Free Full Text

[37] 37. Gorriz JM, Martin-Clemente R, Segovia F, et al.: Is K-fold cross validation the best model selection method for machine learning? Inf. Fusion. 2026; 135: 104404. Publisher Full Text

Development of a machine learning predictive model for early detection of breast cancer

Abstract

Background

Objective

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 5

1. Introduction

2. Methods

2.1 Study design and setting

2.2 Data source and inclusion criteria

2.3 Data preprocessing

2.4 Model development

2.5 Performance evaluation

2.6 Statistical evaluation metrics

2.7 Statistical tools and software

2.8 Ethical considerations

3. Results

3.1 Exploratory Data Analysis (EDA) and data preprocessing

Figure 1. Bar graph showing the frequency of diagnosis column.

3.2 Feature extraction and visualization

Figure 2. (A) Violin plot for first ten features (B) for second set of features (C) for last set of features (D) Joint plot for finding corelation between the concave wort and concavity worst.

Figure 3. (A) Box plot graph of texture mean vs diagnosis of tumor (B) texture worse vs diagnosis of tumor (C) area mean vs diagnosis of tumor (D) area worst vs diagnosis of tumor.

3.3 Label encoding

3.4 Dataset splitting and feature scaling

3.5 Model development

Figure 4. Forest plot comparing performance metrics across different classifiers.

3.6 Performance evaluation

Table 1. Confusion matrix values of the classification algorithms.

Figure 5. Receiver Operating Characteristic (ROC) curve for the classification model.

3.7 Additional performance metrics

Table 2. Performance evaluation of the classification algorithm.

3.8 Statistical validation of classifier performance

Table 3. Statistical validation of classifier performance using five-fold cross-validation.

4. Discussion

Table 4. Summary of related studies comparing datasets, machine learning models, performance outcomes, and limitations in breast cancer classification.

4.1 Strength and limitations of the study

5. Conclusion

Ethics and consent

Data availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated