K-nearest-neighbor algorithm to predict the survival time and classification of various stages of oral cancer: a machine learning approach

Rashmi Siddalingappa; Sekar Kanagaraj

doi:10.12688/f1000research.75469.2

Home Browse K-nearest-neighbor algorithm to predict the survival time and classification...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

K-nearest-neighbor algorithm to predict the survival time and classification of various stages of oral cancer: a machine learning approach

[version 2; peer review: 2 approved]

Rashmi Siddalingappa ¹, Sekar Kanagaraj¹

PUBLISHED 16 Nov 2023

Author details Author details

¹ Computational and Data Sciences, Indian Institute of Science, Bangalore, Karnataka, 560012, India

Rashmi Siddalingappa
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation

Sekar Kanagaraj
Roles: Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Oncology gateway.

Abstract

Background:For years now, cancer treatments have entailed tried-and-true methods. Yet, oncologists and clinicians recommend a series of surgeries, chemotherapy, and radiation therapy. Yet, even amidst these treatments, the number of deaths due to cancer increases at an alarming rate. The prognosis of cancer patients is influenced by mutations, age, and various cancer stages. However, the association between these variables is unclear.

Methods: The present work adopts a machine learning technique—k-nearest neighbor; for both regression and classification tasks, regression for predicting the survival time of oral cancer patients, and classification for classifying the patients into one of the predefined oral cancer stages. Two cross-validation approaches—hold-out and k-fold methods—have been used to examine the prediction results.

Results: The experimental results show that the k-fold method performs better than the hold-out method, providing the least mean absolute error score of 0.015. Additionally, the model classifies patients into a valid group. Of the 429 records, 97 (out of 106), 99 (out of 119), 95 (out of 113), and 77 (out of 91) were classified to its correct label as stages – 1, 2, 3, and 4. The accuracy, recall, precision, and F-measure for each classification group obtained are 0.84, 0.85, 0.85, and 0.84.

Conclusions: The study showed that aged patients with a higher number of mutations than young patients have a higher risk of short survival. Senior patients with a more significant number of mutations have an increased risk of getting into the last cancer stage

Keywords

Cross-Validation, classification, Electronic Medical Records (EMR), K-nearest neighbor (KNN), Regression

Corresponding author: Rashmi Siddalingappa

Competing interests: No competing interests were disclosed.

Grant information: This work is supported by the Science and Engineering Research Board, Department of Science and Technology, Government of India [pdf/2019/000254].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Siddalingappa R and Kanagaraj S. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Siddalingappa R and Kanagaraj S. K-nearest-neighbor algorithm to predict the survival time and classification of various stages of oral cancer: a machine learning approach [version 2; peer review: 2 approved]. F1000Research 2023, 11:70 (https://doi.org/10.12688/f1000research.75469.2) First published: 20 Jan 2022, 11:70 (https://doi.org/10.12688/f1000research.75469.1) Latest published: 16 Nov 2023, 11:70 (https://doi.org/10.12688/f1000research.75469.2)

Revised Amendments from Version 1

In response to the reviewer's comments, we have made several significant improvements to the paper:
To enhance the paper's credibility, we have included data from the World Health Organization to provide statistics regarding global oral cancer deaths in 2020. The full form of MAE (Mean Absolute Error) has been included at the beginning of the paper to ensure clarity when introducing the term. The data cleaning section underwent a complete rewrite to cover ETL processes, handling missing data, and data normalization comprehensively. We conducted a comprehensive comparison study that involved comparing KNN classifiers with other traditional ML classifiers such as SVM and RF. The results of this analysis are now presented in Table 1 of the paper. Visual representations, such as Figure 5, have been incorporated to enhance the clarity of the discussion on performance metrics.
These changes and clarifications collectively address the reviewer's concerns and enhance the overall quality and comprehensibility of the paper.

See the authors' detailed response to the review by Harshad Hegde

Introduction

In India, nearly 1300 people die every day due to cancer, as per the Indian Council of Medical Research (ICMR) reports. The cancer rate is doubled and is likely to increase in the upcoming years (Takiar Ramnath, 2010). On the contrary, cancer treatments are considerably improved over the past two decades. Structurally, cancer is caused by several mutations and the associated genes (Sanchez-Vega et al., 2018). The human body undergoes several mutations, but not all lead to cancer. When the gene is mutated, the activities it performs on the cell take a toll, disrupting the cell’s behavior. They may further turn cancerous. The information regarding cancer stages—de-novo or metastatic stage, diagnostic process, and survival time are obtained by studying and understanding cancer genes. Fascinatingly, different fields are coming together to develop strategies and techniques extensively applied to treat cancer patients. Nonetheless, the death rate due to cancer is still increasing. Earlier studies indicate that a cancer patient’s survival time is directly proportional to age, mutated genes, and the number of mutations (Smith, Joan. C, 2018). Thus, the survival prediction at an early stage is helpful in many ways, such as; 1) the surgical intervention could be reduced, 2) treatment could be altered based on body mass index and nutritional screening (Cavagnari, 2019) to avoid rapid cell proliferation caused by nutrient deprivation, 3) medication and therapies targeting proteins and the related mutations could be introduced, 4) the survival-predicting biomarkers responsible for oral cancer could be learned, and 5) mutation-driven drug discoveries could be made. Further, these predictions and classifications could help clinicians and oncologists determine the patients’ mortality rate and alter the prognosis procedure to better deal with cancer patients. However, the task of predicting survival time is difficult as “the cleaned” and adequately curated data are not available in medical repositories, which come with other complexities with data handling challenges such as data format, dimension scalability, data security, and privacy. Machine learning (ML) could be a one-stop solution to all these limitations. Machine learning techniques require little or no human intervention for processing, decision making, and building models based on system inputs. Several correlations exist between machine learning and medical fields, based on which tremendous results have been achieved (Huang et al., 2020a). Research programs are being implemented in multinational companies such as Google (Liu, 2020), IBM (Matteo Manica, 2019), and Microsoft (Oktay, 2020) to expand a horizon for new ideas in both ML and medical diagnosis. The key processes in the medical world are information retrieval, data analysis, mining patterns from these data, and eventually extracting features that could help clinicians during treatment. Combining the fundamentals of machine learning and advances achieved in the medical field in cancer treatment could be immensely beneficial. To this end, in the present study, a machine learning technique—k-nearest neighbor (KNN) algorithm—is applied to provide significant insights into the relationship between clinical factors such as age, mutated genes, and mutations and its impact on the survival time of the cancer patients. To the best of the author’s knowledge, no other research studies have considered these factors to analyze their effect on survival time. Further, the patients’ medical record is classified into one of the various oral cancer stages. The research study aims to; i) identify the contributing clinical factors for the survival time of oral cancer patients, ii) model an ML-based survival time predictor to generate a clinical report on the go, anywhere and anytime accessible, iii) classify patients into a different stage of oral cancer based on the prediction results, iv) understand the relationship between prognostic markers for survival time and stages of oral cancer, iv) validate the results using Mean Absolute Error (MAE) and F-scores.

The remainder of this paper is organized as; section 2 discusses the research background in connection with clinical factors and survival time. Section 3 discusses the materials and methods adopted in the present research study and the KNN algorithm used for classification and regression tasks. Section 4 demonstrates the experimental analysis and results for the proposed approach. Section 5 illustrates the validation of the results used to evaluate the system’s performance. Section 6 elaborates on the shortcomings of the proposed work along with the scope for future study. Section 7 concludes the research paper.

Research background

Several studies on oral cancer indicate that it is one of the most prevalent cancers worldwide (Cervino Gabriele, 2019). Tackling the mortality rate is crucial as the percentage of people diagnosed with oral cancer is considerably growing. According to WHO, there were about 10 million deaths worldwide due to oral cancer in 2020 (Ferlay et al., 2020), in India, approximately 20 per 100,000 are diagnosed with oral cancer every day, of which 64.8% are males and 35.2% females. In the US, as per the American Cancer Society’s (ACN) Cancer Facts and Figures 2020 (Siegel, Rebecca L,2020), at least one person is killed every hour because of oral cancer, and approximately 38,330 men and 14,880 women are being diagnosed with oral cancer every year. The estimate of oral cancer deaths in China is approximately 0.89/100,000, and hospitals encounter roughly 52,500 new cases every year (Zhang, Shao Kai, 2015). In Saudi Arabia, the annual incidence is evaluated with >3.29% of cases (Al-Jaber, 2016). These statistics demonstrate the necessity to improve oral cancer treatments. Han-Jun Cho et al. used machine learning techniques to draw the association between specific gene mutations and the survival factor in lung squamous cell carcinoma (Cho, Han Jun 2018); to this end, the RapidMinor tool was adopted for the implementation, and feature extraction was realized using the Chi-Squared test and correlation algorithms. Further, they used various classification algorithms such as Naïve Bayes, KNN, support vector machine (SVM), and decision trees to yield specific gene mutations efficiently; Fisher’s extract test and Kaplan–Meier analysis were adopted for the data analysis. The work is implemented on the cancer genome atlas (TCGA) lung adenocarcinoma (LUAD) with the clinical information of 471 patients. Matlak and Szczurek (2017) showed the type of mutations that influences the combativeness of cancer and its consequence on the patients’ survival time. The interactions between the mutated genes are accelerated by epistatic communication. The statistical likelihood-ratio test was proposed to recognize the biomarker tumor suppressor p53-binding protein (TP53BP1) that targets poly-ADP (Adenosine diphosphate) ribose polymerase in breast cancer gene-impaired tumors. Zhang et al. studied the association between racial disparities related to cancer survival and mutation (Zhang, Wensheng, 2017). They conducted a consolidative analysis of TCGA clinical samples and genomic data by establishing a relation between racial imbalance and patient survival time. The analyses were based on the Kaplan–Meier standards and achieved an accuracy of 0.69 area under curve (AUC) measure. Chen et al. conducted a meta-analysis of the connection between TP53 mutations and survival time in osteosarcoma patients using published data (Chen, Zhe, 2016). Their study suggested that TP53 mutations had no impact on a patient’s survival time, thus indicating that TP53 mutations could be effectively used as a prognostic marker to estimate patients’ survival rate suffering from osteosarcoma. Their study also demonstrated that altered TP53 had a poor response to chemotherapy and brought down the survival rate of cancer patients. Ling et al. constructed a cohort for metastatic breast cancer patients using natural language processing techniques such as semi-supervised classification (Ling, Albee Y., 2019). They classified the patients’ EMRs into the de-novo stage or recurrent metastatic breast cancer stage, proven with good sensitivity and specificity measures. The SVM technique was used to classify and categorize patients with diabetes based on their EMRs progress notes (Wright, Adam, 2013). This method achieved an F-score = 0.93 and AUC = 0.956.

Methods

The present study aims to predict the number of survival days to understand cancer patients’ mortality rates based on clinical factors and classify them into cancer stages. Figure 1 shows the flowchart of the proposed model.

Figure 1. Flowchart of the proposed model described in three phases.

Phase – 1 illustrates how data are collected from a data source; phase – 2 shows the data analysis such as data cleaning, data splitting, and applying the KNN algorithm as a regression task to predict the survival time of a cancer patient. Further, validated using standard cross-validation methods; phase – 3 classification using KNN classifier, which is evaluated using the confusion matrix.

Phase – I: Data acquisition

A total of 1,505 oral cancer data records were downloaded from the International Cancer Genome Consortium (ICGC) data portal. The ICGC is a global genomic data-sharing platform that provides the international community with a broad set of genomic data and related cancer types. The dataset contains 50, 178, 243, and 1,034 patients’ records from China, India, Saudi Arabia, and the United States.

Phase – II: Data analysis

Data Extraction: Data is initially collected and stored in a.tsv file for all four countries. This file likely contains a wide range of information, including various entities and attributes.

Data Cleaning: This step focuses on preparing the data for analysis. In this case, not all of the fields in the dataset are relevant for the specific task, which is predicting survival days. Therefore, unnecessary fields are dropped as part of data cleaning. This helps streamline the dataset and remove any noise or irrelevant information that could negatively impact the model’s performance.

Data Splitting: To effectively train a machine learning (ML) model, the dataset needs to be divided into two subsets: a training set and a test set. This is a critical step to evaluate the model’s performance accurately. The content mentions that different split ratios (70:30, 80:20, and 90:10) were considered and tested. The Mean Absolute Error (MAE) score, which measures prediction accuracy, was calculated for each split ratio. The chosen split ratio is 80:20, indicating that 80% of the data is used for training the model, and 20% is used for testing. This choice aims to strike a balance between overfitting (training too well but performing poorly on new data) and underfitting (not learning enough to make accurate predictions).

Handling Missing Data: Missing data has been addressed in two ways: 1) Data Imputation: This method involves estimating missing values based on available data. Common techniques for data imputation include mean imputation, median imputation, mode imputation, and more advanced methods like regression imputation or predictive modeling. 2) Dropping Missing Values: In some cases, if the missing data is minimal and not significant for the analysis or prediction task, you may choose to simply remove rows or columns with missing values. However, this should be done with caution, as it can lead to loss of information. Examples of the Data Imputation process are shown below in Tables 1 and 2.

Let’s consider an example where we have a dataset of patients with missing values in the “Age” and “Survival Days” columns. We’ll use mean imputation to fill in the missing values in the “Age” column:

Table 1. Values before imputation.

Patient ID	Age	Survival days
1	45	120
2	32
3		90
4	50	150

After Mean Imputation:

Table 2. Values after imputation.

Patient ID	Age	Survival days
1	45	120
2	32	120
3	42.33	90
4	50	150

In this example, missing “Age” values are imputed with the mean age of the available data (45 + 32 + 50) /3 = 42.33. The “Survival Days” column, which has missing values, remains unchanged.

Normalization (Standardization): The dataset contains fields with diverse value ranges and types (e.g., integers for age and alphanumeric values for tumor stage). To ensure that these differences in scale and data type do not adversely affect the model’s performance, standardization (z-score normalization) is applied. Standardization scales the data so that it has a mean of 0 and a standard deviation of 1, making it consistent and comparable across all fields. This step is crucial for many machine learning algorithms that are sensitive to the scale of input features.

Implementation Platform: Finally, the implementation of the machine learning model and related methodologies is done using Python programming (specifically, Python 3.9.0) on the Spyder v5.1.1 platform. This choice of programming language and environment allows for the development, training, and testing of the predictive model.

Phase – III: Survival time prediction and classification of oral cancer into various stages

i. K-nearest neighbors for regression and classification

Description: KNN is the most straightforward machine learning algorithm used for both classification and regression tasks. It was initially developed in the early 1950s by Fix and Hodges (Fix, Evelyn, 1989) from the US Air Force School of Aviation Medicine. Classification involves grouping a given dataset into predefined classes, such as classifying records into one of the four stages of oral cancer (Chowdhury, Shovan, 2020). In contrast, regression is used for predicting continuous values such as age, weight, and survival time (Huang et al., 2020b). The KNN algorithm requires a feature space that contains training data points. This ‘feature similarity’ is used to predict the new data point based on its similarity to the existing data points in the feature space. The algorithm determines the distances between an unknown data point and the nearest ‘k’ training data points and classifies the novel point to that particular class belonging. The value of ‘k’ is based on the number of data points selected from the training dataset. The algorithm primarily starts searching for a numerical feature to best draw the test data point by selecting a metric to calculate the distance (Gusti Prahmana, 2020). The distances between the new/unknown data point and the ‘k’ points are calculated using a distance measuring metric such as Euclidean distance, Manhattan distance, and Minkowski distance (Ali, Najat, 2019).

ii. KNN algorithm for classification and regression task

Algorithm 1: Training KNN algorithm for classification and regression task

Set value of k(the value of ‘k’ will be set by the user)

Input: Features expressed by TCGC dataset F={f1,f2,f3, … .fn}, training dataset T={t1,t2,t3, … .tn}, classes C={c1,c2,c3, … .cn}, and the test dataset R

(1)

Output : R \in \sum_{i = 1}^{n} C

The test data are classified into any one of the classes in ‘C’

Step: i) Load the training and test data to the KNN algorithm

Step: ii) Set the value of k (the user will set the value of ‘k’)

Step: iii) For each test data point, implement the following:

a) Compute the distances from ‘R’ to the nearest points using the Euclidean distance measuring metric.
b) Euclidean distance: Let ‘R’ be the test point, and ‘t_i’ be the training data point. Therefore, ‘R’ and ‘t_i’ are vectors; the Euclidean distance between these vectors is defined as follows:
(2)
$d (R, t_{i}) = \sqrt{\sum_{i = 1}^{n} {(R - t_{i})}^{2}}$

c) Arrange the list of all the Euclidean distances in an ascending order
d) k-points are picked out from the training dataset ‘T’
e) Classification task: Classify the test data ‘R’ to class ‘C’, based on the maximum number of feature points near the test data

(or)

Regression task: Calculate the mean of the survival time for these ‘k’ near neighbors. The new mean value indicates the survival time for the test data points End For

Step: iv) Repeat step iii for all the test data points

Results

Comparison of actual and predicted values through error metric – Mean absolute error

Statistical representation is required for estimating the predictive capacity of the model. For this purpose, MAE is used (Botchkarev, 2018). The MAE is the simplest form of calculating the error metric that utilizes the typical magnitude of the residue values. In this method, the residue value is the difference between the actual and predicted values. This value is computed for each input, and the absolute value of these residues is considered such that the positive and negative values will not counteract and nullify. The MAE for each of the ‘k’ scores and the associated input values is calculated as follows:

(3)

Mean Absolute Error (MAE) = \frac{1}{N} \sum_{i = 1}^{n} (x - x')

where x is the output value that represents the actual values and x is the residue output value that represents the predicted values of the model. The difference between the actual and predicted values given by (

x - x'

) is the absolute residue value and is calculated for each instance of the input values ranging from i = 1 to n. Further, ‘N’ indicates the total number of input values.

The graphical representation of MAE and the value for ‘K’ are shown in Figure 2. Small MAE values indicate that the difference between the actual and predicted values is low, specifying that the model is good; further, large MAE values suggest that the model is a poor estimator (Sarker, Iqbal H, 2019). As observed in the figure, the scatter plot of the y-axis represents the coordinates for ‘i’ points of the predicted MAE. The x-axis shows the variations in ‘k’ values. When x = $x'$ , the MAE will show the average vertical distance from the actual and predicted values. The graph between MAE and ‘k’ instances (k = 2 to 20) indicates that the MAE values decrease with an increase in the number of neighbors selected for ‘k’ values. When k = 1, the MAE has a high score, indicating that the error rate is high. The value of ‘k’ was varied from 2 to 19 to determine a suitable ‘k’ value for our experiments. When k = 6, MAE = 0.2, the least in the entire validation curve; thus, for calculating the survival time, ‘k’ was set to 6.

Figure 2. Relationship between ‘k’ value used in the experiment and MAE score for predicting survival time of oral cancer patients.

Note that the MAE reaches a minimum (~0.02) score when k = 6; on the contrary, the error value is maximum (~7) when k = 2.

Classification of records to a definite oral cancer stage

The dataset now contained the predicted survival time values and the previously used parameters and was input into the KNN classifier. Again, to determine the best ‘k,’ the value was varied from 2 to 20. Experimentally, when k = 7, the MAE score attained the most negligible value of 0.1. The cloud of seven neighboring points and the new point is observed in the ‘feature similarity’ space. The test data point is marked with a class label with the highest number of instances in the feature space. If just one example of each class label exists in the feature space, the input data were marked to the closest class. The tumor stages are divided into four variations: T1, T2, T3, and T4. A total of 429 records were efficiently marked into their class belonging using the remaining 1076 data records used as the training data. Figures 3(a)–(f) show the graphical analysis for different data points. The graph indicates other data points relative to the feature space. Each scatterplot displays the variation of cancer stages and related records with non-identical values at the x- and y-axis. A linear relationship is exhibited between any two variables selected.

Figure 3. (a). Indication of a gender (0-female, 1-male) and age with the tumor stage. As observed, aged people are more likely to be in the tumor stage-4. (b) Indication of gender and mutations with the tumor stage. As observed, the number of males (1) has a higher number of mutations than females (0). Besides, most of the males are in tumor stage-4. (c) Indication of age at diagnosis and mutations with the tumor stage. As observed, in tumor stage-4, the number of mutations is high and peaks with the increase in a patient’s age. (d) Indication of age at diagnosis and mutated genes with the tumor stage. The number of mutated genes at tumor stage-4 is high, and this pattern is observed in aged patients rather than middle-aged patients. (e) Indication of gender and mutated genes with the tumor stage. As observed, the male (1) experiences more mutated genes than the female (0). The observation also suggests that most of the males are in tumor stage-4. (f) Indication of mutations and mutated genes. As observed, tumor stage-4 has many mutations and mutated genes. Further, for a high number of mutations, the mutated genes are also high in tumor stage-4.

Validation

In this section, the efficiency is measured using accuracy estimation methods, and two main approaches are compared: Hold-out and k-fold cross-validation.

Error estimation for prediction of survival time using the hold-out cross-validation method

This is the most naïve and straightforward approach of the cross-validation method. The entire dataset is split up only once into training data and test data. The training dataset is chosen randomly, and the fit function is used to train the KNN classifier. The test data, usually 1/3 of the original data, is later predicted and compared with the actual values (Xu, Yun, 2018). The errors obtained are stored and returned as a mean absolute test error - a risk metric, for evaluating the regression model. The MAE score attained in this method is equal to 9.3, which is relatively high for the model. The high MAE score indicates that the model is performing poorly for the dataset. Interestingly, the cross-validation method consumes minimal time to compute the model’s efficiency, leading to the minimal time complexity of the model. This is potentially due to high variance leading to over-fitting.

Error estimation for prediction of survival time using the k-fold cross-validation method

To improve over the hold-out method, the data must be split into equally sized observations (Wong, Tzu Tsung, 2020). The dataset comprised 1505 data entries. For the k-fold error estimation, only 1500 were considered, and five samples were skipped based on selective sampling. A single fold dataset was used for testing, and the remaining k-1 folds were used for training (Yadav, Sanjay, 2016). According to k-fold cross-validation, the entire dataset D_s (s = 1, 2, 3, … n) was positioned uniformly. Then dataset D_s was divided at k-folds of equal sizes such that each division D_i will have ‘k’ number of data points to be evaluated. For experimentation, we used the 10-fold cross-validation method. The dataset contained 1500 records; therefore, F={f₁,f₂,f₃, … .f₁₅₀₀}, and each feature f_i itself contained attributes such as age, gender, tumor stage, survival time, mutations, and mutated genes. Thus, Eq. (4), explains how the 10fold cross features are selected and provided for the experiment. Since k = 10, the dataset ‘F’ is divided into ten groups such that each group consists of 150 data points.

(4)

∴ 10 - fold = \sum_{i = 1}^{10} f_{i} [1 - 150]

The accuracy is calculated at each fold, and the mean of all the 10-fold accuracies is estimated. The test data are kept isolated at all the iterations at each fold to obtain an unbiased approximate model performance. The test data were never used to fine-tune the model. The entire data set is divided into 10-fold ‘training’ and ‘test’ data. Once the data are divided into 10-folds training and test data, the accuracy is calculated using the standard metric MAE score. The values obtained for the validation error rate are as follows: -0.036, -0.030, 0.014, 0.019, 0.040, 0.035, 0.061, 0.020, 0.028, -0.0001; further, the mean score of the validation error rate is 0.015. The score is perfect as the error rate is negligible and could be rounded to zero. Thus, the k-fold method outperformed the hold-out method in terms of the MAE score when practically applied.

Validating the classification of patients’ records into specific oral cancer stage through f-measures

Classification accuracy is the number of correct predictions from all the predictions made in the proposed model. With the accuracy measure, the robustness of the model is determined. In most cases, the classifier results are presented in the form of a confusion matrix, which is generally expressed in terms of four measures: true-positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) (Lever, Jake, 2016). TP is when the cancer is classified into its correct stage; TN is when a patient’s record is not classified into any wrong stage; FP is when the patient’s record is classified into the wrong stage, e.g., when the patient’s record is ranked as cancer stage-4, but it should have been recognized as cancer stage-3. Finally, FN is when the patients’ record is wrongly classified into a specific stage or sometimes not recognized as a cancer stage. FN leads to disastrous results for cancer datasets as a patient would be detected as not having cancer. Further, no measures would be taken, cancer would aggravate, and the patient will still not be diagnosed until severe symptoms are observed. The subsequent metrics related to accuracy are precision and recall. Precision measures the number of positive predictions over all the positive classes predicted. A recall is a true positive rate that indicates the ratio of all cancer samples that the model accurately predicted. The f-score represents the balancing weight between the recall and precision. Eqs. 5a, 5b, and 5c explain the recall, precision, and Fmeasure, respectively.

(5a)

Recall (R) = \frac{TP}{TP + FN}

(5b)

Precision (P) = \frac{TP}{TP + FN}

(5c)

F - Score = 2 * \frac{P * R}{P + R}

The accuracy achieved for the classification of cancer stages is shown in Figure 4. The model efficiently classified 429 data records into one of the four oral cancer stages.

Figure 4. Screenshot of the accuracy metrics obtained by the proposed model.

The figure illustrates the following; i) the confusion matrix at the top left panel. There were 429 records used for the classification task. The confusion matrix suggests the recall, f1-score, and iii) accuracy, macro, and weighted average accuracy. Same, both row-wise and column-wise, ii) calculation of precision.

Performance Metric Calculation: Confusion Matrix: Initially, the top left panel of Figure 4 illustrates the confusion matrix of the four stages of oral cancer obtained with the classifier. The confusion matrix summarizes the classifier’s performance in categorizing each cancer stage. As observed, both row-wise and column-wise additions of the confusion matrix result in a total of 429, as indicated by support records (106 + 119 + 113 + 91 = 429). This value represents the sum of all samples involved in the classification task.

Individual Stage Metrics: The key performance metrics, including precision, recall, and F-Score, are calculated for each of the cancer stages using specific formulae. Let’s take the example of T1:

• Precision (Column-Wise Representation): Precision is calculated as TP/(TP + FP), where TP is the number of true positives, and FP is the number of false positives. For T1, precision is computed as 97/(97 + 15 + 3 + 8) = 0.78.
• Recall (Row-Wise Representation): Recall is determined as TP/(TP + FN), where TP is the number of true positives, and FN is the number of false negatives. For T1, recall is derived as 97/(97 + 3 + 1 + 5) = 0.91.
• F-Score: The F-Score, a harmonic mean of precision and recall, is calculated using the formula (2 * Precision * Recall)/(Precision + Recall). For T1, the F-Score is computed as (2 * 0.78 * 0.91)/(0.78 + 0.91) = 0.84.

Similar calculations are performed for the other cancer stages (T2, T3, T4).

Overall Accuracy Calculation: The overall accuracy is calculated by considering all samples together. It is determined as the number of correctly classified samples (TP for all stages) divided by the total number of samples (429). This calculation results in an overall accuracy of 85.7%. Figure 5 shows a plot to visually describe the performance metrics considered in the classification process.

Figure 5. KNN classifier performance evaluation.

Macro and Weighted Averages: To assess the classifier’s overall performance, macro and weighted averages are considered. Macro-averaged scores represent the arithmetic mean of precision, recall, and F1-Score across all stages. For instance:

• Macro-Average Precision = (0.78 + 0.93 + 0.93 + 0.78)/4 = 0.85.
• Macro-Average Recall = (0.91 + 0.83 + 0.84 + 0.84)/4 = 0.85.
• Macro-Average F1-Score = (0.84 + 0.87 + 0.88 + 0.80)/4 = 0.84.

Weighted averages are calculated by multiplying the weighted sum of each metric for each stage with individual support records and then dividing by the total number of support records (429). For example:

• Weighted Average Precision = (0.78 * 106 + 0.93 * 119 + 0.93 * 113 + 0.78 * 91)/429 = 0.86.
• Weighted Average Recall = (0.91 * 106 + 0.83 * 119 + 0.84 * 113 + 0.84 * 91)/429 = 0.85.
• Weighted Average F1-Score = (0.84 * 106 + 0.87 * 119 + 0.88 * 113 + 0.80 * 91)/429 = 0.85.

Overall Accuracy Verification: The overall accuracy is verified by examining all samples simultaneously. It involves identifying the correctly predicted cancer stages (TP) located along the mid-diagonal of the confusion matrix. The accuracy is then calculated as follows:

Overall Accuracy = (97 + 99 + 95 + 77) / [(97 + 99 + 95 + 77) + (3 + 1 + 5 + 15 + 2 + 3 + 3 + 2 + 13 + 8 + 2 + 4)] = 0.8578.

Thus, the proposed classifier demonstrates an 85.7% accuracy in effectively segregating oral cancer data records into various stages in a multi-class classification, as supported by the experimental results.

Comparison of KNN with other traditional ML algorithms

In order to evaluate the classification performance of our proposed methodologies, we conducted experiments using three distinct classifiers: k-Nearest Neighbors (kNN), Support Vector Machine (SVM), and Random Forest (RF). The obtained results for these classifiers are summarized in Table 3 below. This table provides an overview of key performance measures, including accuracy, recall, precision, and F-Score, for each classifier in classifying patients into different stages.

Table 3. Classifier performance metrics.

Classifier	Accuracy	Recall	Precision	F-Score
KNN	0.84	0.84	0.85	0.84
SVM	0.82	0.80	0.81	0.82
RF	0.83	0.83	0.84	0.83

Discussion

The present study was conducted to review the association between the clinical factors and patients’ survival time. The research could considerably influence the treatment cycle and subsequent intervention in the prognosis procedure. The correlation between cancer patients’ mutation and mortality rate indicates that older patients with several mutations had a short survival time. In contrast, even with an average number of mutations, say 2000-4000, the survival time was long for a young patient. The predicted survival time of patients at their early cancer stage displayed a longer survival time of 5 – 7 years, followed by metastatic cancers. Also, patients with a short survival time were classified into cancer stage-4. Thus, the study highlights the following: i) more males are diagnosed with oral cancer than females; ii) the number of mutations increases with the patients’ age; iii) when the mutation number is high, the number of genes that are mutated is also high; and iv) tumor stage-4 has more number of mutations and mutated genes and entails more of the older population than the younger people. This indicates that the treatment selection and diagnosis are influenced by these fundamental factors that most clinical studies overlook; as with the regression technique, the present study models the effect of multivariate data points used as the input variables for the execution. The dataset may be scaled up to include many datasets with the specific tumor stage classification to understand efficacy better. Factors such as age and mutations directly impact the survival rate of the patient.

The present study has some limitations. As pointed out earlier, the corpus size in ML-based research contributes to the system’s robustness and accuracy. When the corpus size is relatively small, the model suffers from overfitting leading to misclassification of cancer stage and incorrect survival time prediction. Along with this, feature selection is a crucial ingredient for integrity. A simple wrong entry one row down by an operator may lead to inconsistent data, eventually piling up many errors in the later evaluation stage. While selecting essential features from the dataset, factors such as measurability, pathological assessments, and relevance must be considered, which the present study failed to grab. The future directions of the research work are: i) apply ML algorithms to analyze both clinical and genomic factors to study which mutations are predominantly contributing to causing cancer, ii) ML-based pathological model, such that, when the data is input sitting at home or clinic, it predicts the cancer progress and gives a report on survival time, treatments that are likely to be followed and drugs based on the genomic make-up of the patient, iii) develop pre-clinical ML model to control cancer development through early diagnosis, deliver clinical trials with higher accuracy, iv) the gene-specific reactions should be investigated and their role in inducing the mutations in the protein structure and the corresponding relevance on the survival time. This would enable clinicians to adopt effective, targeted therapy options.

Conclusion

Over the past few decades, the advent of ML-based algorithms for cancer treatment has paved the way for better prognosis procedures. However, the prediction of survivability by analyzing the clinical factors present in EMRs is often neglected. In this study, KNN, a supervised machine learning approach, was applied to precisely estimate the survival time of an oral cancer patient. The study also classifies the dataset into a specific cancer stage. With the increasing number of deaths due to oral cancer, the purview of such a study is necessary. The fundamentals of survival time analysis were explored using the data available, and the limitations owing to a poor diagnosis were discussed. A whole new treatment procedure could be implemented if the survival time is known. In precision medicine, identifying the key proteins and mutated genes is essential for devising the treatment strategies. Most importantly, the proposed approach illustrates the use of digital data collected by the health care system consistently but underexploited by clinicians. The study demonstrates the accuracy measures using cross-validation and f-scores.

Data availability

For ease of use of the proposed methodologies, the entire code and all the datasets with relevant results have been deposited at the GitHub repository (https://github.com/RashmiSKarthik/Machine). The data has also been deposited on Zenodo (10.5281/zenodo.5819317). The code could be scaled up to other cancer datasets, and the model works efficiently on any Python platform. Any copyrighted material of this research can be reproduced with appropriate work citations.

The repository is publicly available. Any queries and concerns related to code and implementation may be directed to the corresponding author of this manuscript.

Acknowledgment

One of the authors (RS) acknowledges the Department of Science and Technology – Science and Engineering Research Board (DST-SERB), New Delhi, India, for providing a research grant and postdoctoral fellowship (NPDF, sanction order no PDF/2019/000254). The authors would like to thank the Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India, for providing complete support to execute this work.

References

Al-Jaber A, Al-Nasser L, El-Metwally A: Epidemiology of Oral Cancer in Arab Countries. Saudi Med. J. 2016; 37: 249–255. Saudi Arabian Armed Forces Hospital. PubMed Abstract | Publisher Full Text
Ali N, Neagu D, Trundle P: Evaluation of K-Nearest Neighbour Classifier Performance for Heterogeneous Data Sets. SN Applied Sciences. 2019; 1(12). Springer Nature.Publisher Full Text
Botchkarev A: Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology.2018 September. Reference Source
Cavagnari MAV, Silva TD, Pereira MAH, et al.: Impact of Genetic Mutations and Nutritional Status on the Survival of Patients with Colorectal Cancer. BMC Cancer. 2019; 19(1): 644. BioMed Central Ltd. PubMed Abstract | Publisher Full Text
Cervino G, Fiorillo L, Herford AS, et al.: Molecular Biomarkers Related to Oral Carcinoma: Clinical Trial Outcome Evaluation in a Literature Review. Dis. Markers. 2019; 2019: 1–11. Hindawi Limited. PubMed Abstract | Publisher Full Text
Chen Z, Guo J, Zhang K, et al.: TP53 Mutations and Survival in Osteosarcoma Patients: A MetaAnalysis of Published Data. Dis. Markers. 2016; 2016: 1–5. Hindawi Limited. PubMed Abstract | Publisher Full Text
Cho HJ, Lee S, Ji YG, et al.: Association of Specific Gene Mutations Derived from Machine Learning with Survival in Lung Adenocarcinoma. PLoS One. 2018; 13(11): e0207204. Public Library of Science. PubMed Abstract | Publisher Full Text
Chowdhury S, Schoen MP: Research Paper Classification Using Supervised Machine Learning Techniques. 2020 Intermountain Engineering, Technology and Computing, IETC 2020. 2020. Institute of Electrical and Electronics Engineers Inc. Publisher Full Text
Ferlay J, Ervik M, Lam F, et al.: Global Cancer Observatory: Cancer Today. Lyon: International Agency for Research on Cancer. 2020.
Fix E, Hodges JL: Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review/Revue Internationale de Statistique. 1989; 57(3): 238. JSTOR. Publisher Full Text
Gusti Prahmana I, Mawengkang H, Zarlis M: Knearst - Neighbor Algorithm Analysis Using Simple Linier Regression Modeling. Int. J. Psychosoc. Rehabil. 2020; 24(Special Issue 2).
Huang JC, Ko KM, Shu MH, et al.: Application and Comparison of Several Machine Learning Algorithms and Their Integration Models in Regression Problems. Neural Comput. Applic. 2020b; 32(10): 5461–5469. Springer. Publisher Full Text
Huang S, Yang J, Fong S, et al.: Artificial Intelligence in Cancer Diagnosis and Prognosis: Opportunities and Challenges. Cancer Lett. 2020a; 471: 61–71. Elsevier Ireland Ltd. PubMed Abstract | Publisher Full Text
Lever J, Krzywinski M, Altman N: Classification Evaluation. Nat. Methods. 2016; 13(8): 603–604. Springer Science and Business Media LLC. Publisher Full Text
Ling AY, Kurian AW, Caswell-Jin JL, et al.: Using Natural Language Processing to Construct a Metastatic Breast Cancer Cohort from Linked Cancer Registry and Electronic Medical Records Data. JAMIA Open. 2019; 2(4): 528–537. Oxford University Press. PubMed Abstract | Publisher Full Text
Liu Y, Jain A, Eng C, et al.: A Deep Learning System for Differential Diagnosis of Skin Diseases. Nat. Med. 2020; 26(6): 900–908. Nature Research. PubMed Abstract | Publisher Full Text
Manica M, Oskooei A, Born J, et al.: Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional Encoders. Mol. Pharm. 2019; 16: 4797–4806. American Chemical Society. Publisher Full Text
Matlak D, Szczurek E: Epistasis in Genomic and Survival Data of Cancer Patients. PLoS Comput. Biol. 2017; 13(7): e1005626. Public Library of Science. PubMed Abstract | Publisher Full Text
Oktay O, Nanavati J, Schwaighofer A, et al.: Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw. Open. 2020; 3: e2027426. American Medical Association. PubMed Abstract | Publisher Full Text
Sanchez-Vega F, Mina M, Armenia J, et al.: Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell. 2018; 173(2): 321–337.e10. Cell Press. PubMed Abstract | Publisher Full Text
Sarker IH, Kayes ASM, Watters P: Effectiveness Analysis of Machine Learning Classification Models for Predicting Personalized Context-Aware Smartphone Usage. Journal of Big Data. 2019; 6(1). SpringerOpen. Publisher Full Text
Siegel RL, Miller KD, Jemal A: Cancer Statistics, 2020. CA Cancer J. Clin. 2020; 70(1): 7–30. Wiley. Publisher Full Text
Smith JC, Sheltzer JM: Systematic Identification of Mutations and Copy Number Alterations Associated with Cancer Patient Prognosis. elife. 2018; 7(December). eLife Sciences Publications Ltd. PubMed Abstract | Publisher Full Text
Takiar R, Nadayil D, Nandakumar A: Projections of Number of Cancer Cases in India (2010-2020) by Cancer Groups. Asian Pac. J. Cancer Prev. 2010; 11(4): 1045–1049. Asian Pacific Organization for Cancer Prevention. PubMed Abstract
Wong TT, Yeh PY: Reliable Accuracy Estimates from K-Fold Cross Validation. IEEE Trans. Knowl. Data Eng. 2020; 32(8): 1586–1594. IEEE Computer Society. Publisher Full Text
Wright A, McCoy AB, Henkin S, et al.: Use of a Support Vector Machine for Categorizing Free-Text Notes: Assessment of Accuracy across Two Institutions. J. Am. Med. Inform. Assoc. 2013; 20(5): 887–890. PubMed Abstract | Publisher Full Text
Xu Y, Goodacre R: On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. Journal of Analysis and Testing. 2018; 2(3): 249–262. Nonferrous Metals Society of China. PubMed Abstract | Publisher Full Text
Yadav S, Shukla S: Analysis of K-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification. Proceedings - 6th International Advanced Computing Conference, IACC 2016. 2016: 78–83. Institute of Electrical and Electronics Engineers Inc. Publisher Full Text
Zhang H, Lin Z, Jiang Y: Overfitting and Underfitting Analysis for Deep Learning Based End-to-End Communication Systems. 2019 11th International Conference on Wireless Communications and Signal Processing, WCSP 2019. 2019. Institute of Electrical and Electronics Engineers Inc. Publisher Full Text
Zhang SK, Zheng R, Chen Q, et al.: Oral Cancer Incidence and Mortality in China, 2011. Chin. J. Cancer Res. 2015; 27(1): 44–51. AME Publishing Company. PubMed Abstract | Publisher Full Text
Zhang W, Edwards A, Flemington EK, et al.: Racial Disparities in Patient Survival and Tumor Mutation Burden, and the Association between Tumor Mutation Burden and Cancer Incidence Rate. Sci. Rep. 2017; 7(1). Nature Publishing Group. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 20 Jan 2022

Author details Author details

¹ Computational and Data Sciences, Indian Institute of Science, Bangalore, Karnataka, 560012, India

Rashmi Siddalingappa
Roles: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation

Sekar Kanagaraj
Roles: Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work is supported by the Science and Engineering Research Board, Department of Science and Technology, Government of India [pdf/2019/000254].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 16 Nov 2023, 11:70

https://doi.org/10.12688/f1000research.75469.2

version 1

Published: 20 Jan 2022, 11:70

https://doi.org/10.12688/f1000research.75469.1

© 2023 Siddalingappa R and Kanagaraj S. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Siddalingappa R and Kanagaraj S. K-nearest-neighbor algorithm to predict the survival time and classification of various stages of oral cancer: a machine learning approach [version 2; peer review: 2 approved]. F1000Research 2023, 11:70 (https://doi.org/10.12688/f1000research.75469.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 2

VERSION 2

PUBLISHED 16 Nov 2023

Revised

Views

Reviewer Report 30 Nov 2023

Harshad Hegde, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Approved

https://doi.org/10.5256/f1000research.156859.r223448

I believe the quality of the paper has improved significantly ... Continue reading

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 20 Jan 2022

Views

Reviewer Report 30 Aug 2023

Harshad Hegde, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.79347.r201257

"According to the GLOBOCAN 2008 estimate " - A more recent statistic would be helpful (from WHO, maybe?).
I don't see the full-form for MAE which should be at the first place

"According to the GLOBOCAN 2008 estimate " - A more recent statistic would be helpful (from WHO, maybe?).
I don't see the full-form for MAE which should be at the first place it was used. I see it in the "Results" section which is after reading the abbreviation at least 3 times prior.
So as far as I understand, only 4 features were used (age, gender, mutations, and mutated genes)? " The remaining fields are dropped" - Was there a reason? Please explain. Was the decision on these features based on specific literature or arbitrary?
Explanation for the data normalization stage in detail would be very helpful for recreating the study for future research. For e.g. " tumor stage by alphanumeric values" - how was this information "normalized" to make it classifier friendly? Did you one-hot encode it or use the features as boolean? The paper does not mention this. A detailed explanation about the data ETL (extract-transform-load) right from the data ingest stage up to the point of classifier ready input data is crucial.

"multiple fields had null or no weights" - how did you tackle the issue? Were there steps taken to handle missing data (e.g. data imputation etc.)? Show examples of data in the form of small tables to make these things clear for readers.
Were other classifiers besides kNN considered? If not, why? If yes, results from those for comparison would be helpful.

"Further, they used various classification algorithms such as Naïve Bayes, KNN, support vector machine (SVM), and decision trees to yield specific gene mutations efficiently" -Other studies have clearly used other classifiers. The paper would be stronger if you ran your data against the ones above and/or others (like random forests, MLP etc.) that are appropriate for your data volume and reporting comparisons.
The results could be better presented in the form of a neat table or a figure so that it is easily legible.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: I have been a bioinformatics researcher for almost a decade.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 16 Nov 2023

Rashmi Siddalingappa, Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, India

16 Nov 2023

Author Response

The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about ... Continue reading The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about 10 million deaths worldwide due to oral cancer in 2020 (Ferlay et al., 2020)

Answer to question 2: Earlier, the comparison analysis section has given MAE full form. however, the authors now have included the MAE full form at the beginning of the paper, where the term gets introduced first

Answer to questions 3 and 4: The data cleaning section has been completely rewritten, which now addresses all the concerns that were raised, such as ETL, Handling Missing Data, and Normalization.

Answer to question 5: Yes, we agree, and that is why a comparison study was done w.r.t to the KNN classifiers with other traditional ML classifiers such as SVM and RF, and the results of the same are discussed in Table 1.

Answer to question 6: The results of our work now are pictorially represented in the form of a figure/plot, as shown in Figure 5. Accordingly, some changes have been made in this section to talk about the performance metrics with better clarity.
The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about 10 million deaths worldwide due to oral cancer in 2020 (Ferlay et al., 2020)

Answer to question 2: Earlier, the comparison analysis section has given MAE full form. however, the authors now have included the MAE full form at the beginning of the paper, where the term gets introduced first

Answer to questions 3 and 4: The data cleaning section has been completely rewritten, which now addresses all the concerns that were raised, such as ETL, Handling Missing Data, and Normalization.

Answer to question 5: Yes, we agree, and that is why a comparison study was done w.r.t to the KNN classifiers with other traditional ML classifiers such as SVM and RF, and the results of the same are discussed in Table 1.

Answer to question 6: The results of our work now are pictorially represented in the form of a figure/plot, as shown in Figure 5. Accordingly, some changes have been made in this section to talk about the performance metrics with better clarity.
Competing Interests: The authors declare no competing interests related to this work. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Nov 2023

Rashmi Siddalingappa, Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, India

16 Nov 2023

Author Response

The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about ... Continue reading The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about 10 million deaths worldwide due to oral cancer in 2020 (Ferlay et al., 2020)

Answer to question 2: Earlier, the comparison analysis section has given MAE full form. however, the authors now have included the MAE full form at the beginning of the paper, where the term gets introduced first

Answer to questions 3 and 4: The data cleaning section has been completely rewritten, which now addresses all the concerns that were raised, such as ETL, Handling Missing Data, and Normalization.

Answer to question 5: Yes, we agree, and that is why a comparison study was done w.r.t to the KNN classifiers with other traditional ML classifiers such as SVM and RF, and the results of the same are discussed in Table 1.

Answer to question 6: The results of our work now are pictorially represented in the form of a figure/plot, as shown in Figure 5. Accordingly, some changes have been made in this section to talk about the performance metrics with better clarity.
The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about 10 million deaths worldwide due to oral cancer in 2020 (Ferlay et al., 2020)

Answer to question 2: Earlier, the comparison analysis section has given MAE full form. however, the authors now have included the MAE full form at the beginning of the paper, where the term gets introduced first

Answer to questions 3 and 4: The data cleaning section has been completely rewritten, which now addresses all the concerns that were raised, such as ETL, Handling Missing Data, and Normalization.

Answer to question 5: Yes, we agree, and that is why a comparison study was done w.r.t to the KNN classifiers with other traditional ML classifiers such as SVM and RF, and the results of the same are discussed in Table 1.

Answer to question 6: The results of our work now are pictorially represented in the form of a figure/plot, as shown in Figure 5. Accordingly, some changes have been made in this section to talk about the performance metrics with better clarity.
Competing Interests: The authors declare no competing interests related to this work. Close
Report a concern

Views

Reviewer Report 03 Feb 2022

Shivanand Sharanappa Gornale, Department of Computer Science, Rani Channamma University, Belagavi, Karnataka, India

Approved

https://doi.org/10.5256/f1000research.79347.r121749

The paper has significant merits about the implications of Machine Learning (ML) techniques in predicting the survival of oral cancer patients and further performing the classification tasks on the medical records. The introduction and positioning of the work in the ... Continue reading

This is a well-written paper with detailed information about the impact of cancer on the current trend.
The authors have precisely explained how cancer succumbs people and what the prime causes are.
Further, the authors discuss how ML can impact predicting the survival time of cancer. The authors point out the merits of this prediction to clinicians, patients, and researchers
The authors have performed the classification of the predicted data into various oral cancer stages
It is the first time introducing KNN as classification and regression tasks to predict the survival time based on the number of mutations and age.
The ‘k’ component was measured experimentally in the research, which would evade any hypothetical argument on why a particular value of ‘k’ is fixed. The authors have shown varying results for different ‘k’ values starting from 2 to 20.
The story is straightforward, and the data underlines the argumentation clearly.
The conclusions are persuasive and very important for oral cancer prognosis procedures. This work would inspire more research concentrating on the survival prediction analysis, thereby improving the treatment modification for patients’ right from the diagnosis stage.
References citations are appropriate.

Weaknesses

The authors may apply these strategies using advanced ML techniques in the future work of the research study
The methodology is generally acceptable. Some general data on geographically-separated populations may be done in the future to fully understand the potentiality/impact of the work for survival prediction
It would be nice to have some discussion on the uncertainties/requirements of the ML involvement in clinical studies like this one.The paper has significant merits about the implications of ML techniques in predicting the survival of oral cancer patients and further performing the classification tasks on the medical records. The introduction and positioning of the work in the general context of oral cancer and ML is done well, with appropriate metrics.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Digital Image Processing and Pattern Recognition, Biometric Data Analysis, Natural Language Processing

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 20 Jan 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 2 (revision) 16 Nov 23		read
Version 1 20 Jan 22	read	read

Shivanand Sharanappa Gornale, Rani Channamma University, Belagavi, India
Harshad Hegde, Lawrence Berkeley National Laboratory, Berkeley, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

3 Views

30 Nov 2023 | for Version 2

Harshad Hegde, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

3 Views Cite this report Responses(0)

Approved

I believe the quality of the paper has improved significantly and I commend the author for addressing my previous review.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

I have been a bioinformatics researcher for almost a decade.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

11 Views

30 Aug 2023 | for Version 1

Harshad Hegde, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

11 Views Cite this report Responses(1)

Approved With Reservations

"According to the GLOBOCAN 2008 estimate " - A more recent statistic would be helpful (from WHO, maybe?).
I don't see the full-form for MAE which should be at the first place it was used. I see it in the "Results" section which is after reading the abbreviation at least 3 times prior.
So as far as I understand, only 4 features were used (age, gender, mutations, and mutated genes)? " The remaining fields are dropped" - Was there a reason? Please explain. Was the decision on these features based on specific literature or arbitrary?
Explanation for the data normalization stage in detail would be very helpful for recreating the study for future research. For e.g. " tumor stage by alphanumeric values" - how was this information "normalized" to make it classifier friendly? Did you one-hot encode it or use the features as boolean? The paper does not mention this. A detailed explanation about the data ETL (extract-transform-load) right from the data ingest stage up to the point of classifier ready input data is crucial.

"multiple fields had null or no weights" - how did you tackle the issue? Were there steps taken to handle missing data (e.g. data imputation etc.)? Show examples of data in the form of small tables to make these things clear for readers.
Were other classifiers besides kNN considered? If not, why? If yes, results from those for comparison would be helpful.

"Further, they used various classification algorithms such as Naïve Bayes, KNN, support vector machine (SVM), and decision trees to yield specific gene mutations efficiently" -Other studies have clearly used other classifiers. The paper would be stronger if you ran your data against the ones above and/or others (like random forests, MLP etc.) that are appropriate for your data volume and reporting comparisons.
The results could be better presented in the form of a neat table or a figure so that it is easily legible.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

I have been a bioinformatics researcher for almost a decade.

Respond to this report

Responses (1)

Author Response

16 Nov 2023

Rashmi Siddalingappa, Computational and Data Sciences, Indian Institute of Science, Bangalore, 560012, India

The authors of this paper thank the reviewer for the comments and below are our responses for the same:

Answer to question 1: According to WHO, there were about 10 million deaths worldwide due to oral cancer in 2020 (Ferlay et al., 2020)

Answer to question 2: Earlier, the comparison analysis section has given MAE full form. however, the authors now have included the MAE full form at the beginning of the paper, where the term gets introduced first

Answer to questions 3 and 4: The data cleaning section has been completely rewritten, which now addresses all the concerns that were raised, such as ETL, Handling Missing Data, and Normalization.

Answer to question 5: Yes, we agree, and that is why a comparison study was done w.r.t to the KNN classifiers with other traditional ML classifiers such as SVM and RF, and the results of the same are discussed in Table 1.

Answer to question 6: The results of our work now are pictorially represented in the form of a figure/plot, as shown in Figure 5. Accordingly, some changes have been made in this section to talk about the performance metrics with better clarity.

View more View less

Competing Interests

The authors declare no competing interests related to this work.

Back to all reports

Reviewer Report

20 Views

03 Feb 2022 | for Version 1

Shivanand Sharanappa Gornale, Department of Computer Science, Rani Channamma University, Belagavi, Karnataka, India

20 Views Cite this report Responses(0)

Approved

This is a well-written paper with detailed information about the impact of cancer on the current trend.
The authors have precisely explained how cancer succumbs people and what the prime causes are.
Further, the authors discuss how ML can impact predicting the survival time of cancer. The authors point out the merits of this prediction to clinicians, patients, and researchers
The authors have performed the classification of the predicted data into various oral cancer stages
It is the first time introducing KNN as classification and regression tasks to predict the survival time based on the number of mutations and age.
The ‘k’ component was measured experimentally in the research, which would evade any hypothetical argument on why a particular value of ‘k’ is fixed. The authors have shown varying results for different ‘k’ values starting from 2 to 20.
The story is straightforward, and the data underlines the argumentation clearly.
The conclusions are persuasive and very important for oral cancer prognosis procedures. This work would inspire more research concentrating on the survival prediction analysis, thereby improving the treatment modification for patients’ right from the diagnosis stage.
References citations are appropriate.

Weaknesses

The authors may apply these strategies using advanced ML techniques in the future work of the research study
The methodology is generally acceptable. Some general data on geographically-separated populations may be done in the future to fully understand the potentiality/impact of the work for survival prediction
It would be nice to have some discussion on the uncertainties/requirements of the ML involvement in clinical studies like this one.The paper has significant merits about the implications of ML techniques in predicting the survival of oral cancer patients and further performing the classification tasks on the medical records. The introduction and positioning of the work in the general context of oral cancer and ML is done well, with appropriate metrics.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Digital Image Processing and Pattern Recognition, Biometric Data Analysis, Natural Language Processing

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Al-Jaber A, Al-Nasser L, El-Metwally A: Epidemiology of Oral Cancer in Arab Countries. Saudi Med. J. 2016; 37: 249–255. Saudi Arabian Armed Forces Hospital. PubMed Abstract | Publisher Full Text

[2] Ali N, Neagu D, Trundle P: Evaluation of K-Nearest Neighbour Classifier Performance for Heterogeneous Data Sets. SN Applied Sciences. 2019; 1(12). Springer Nature.Publisher Full Text

[3] Botchkarev A: Performance Metrics (Error Measures) in Machine Learning Regression, Forecasting and Prognostics: Properties and Typology.2018 September. Reference Source

[4] Cavagnari MAV, Silva TD, Pereira MAH, et al.: Impact of Genetic Mutations and Nutritional Status on the Survival of Patients with Colorectal Cancer. BMC Cancer. 2019; 19(1): 644. BioMed Central Ltd. PubMed Abstract | Publisher Full Text

[5] Cervino G, Fiorillo L, Herford AS, et al.: Molecular Biomarkers Related to Oral Carcinoma: Clinical Trial Outcome Evaluation in a Literature Review. Dis. Markers. 2019; 2019: 1–11. Hindawi Limited. PubMed Abstract | Publisher Full Text

[6] Chen Z, Guo J, Zhang K, et al.: TP53 Mutations and Survival in Osteosarcoma Patients: A MetaAnalysis of Published Data. Dis. Markers. 2016; 2016: 1–5. Hindawi Limited. PubMed Abstract | Publisher Full Text

[7] Cho HJ, Lee S, Ji YG, et al.: Association of Specific Gene Mutations Derived from Machine Learning with Survival in Lung Adenocarcinoma. PLoS One. 2018; 13(11): e0207204. Public Library of Science. PubMed Abstract | Publisher Full Text

[8] Chowdhury S, Schoen MP: Research Paper Classification Using Supervised Machine Learning Techniques. 2020 Intermountain Engineering, Technology and Computing, IETC 2020. 2020. Institute of Electrical and Electronics Engineers Inc. Publisher Full Text

[9] Ferlay J, Ervik M, Lam F, et al.: Global Cancer Observatory: Cancer Today. Lyon: International Agency for Research on Cancer. 2020.

[10] Fix E, Hodges JL: Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. International Statistical Review/Revue Internationale de Statistique. 1989; 57(3): 238. JSTOR. Publisher Full Text

[11] Gusti Prahmana I, Mawengkang H, Zarlis M: Knearst - Neighbor Algorithm Analysis Using Simple Linier Regression Modeling. Int. J. Psychosoc. Rehabil. 2020; 24(Special Issue 2).

[12] Huang JC, Ko KM, Shu MH, et al.: Application and Comparison of Several Machine Learning Algorithms and Their Integration Models in Regression Problems. Neural Comput. Applic. 2020b; 32(10): 5461–5469. Springer. Publisher Full Text

[13] Huang S, Yang J, Fong S, et al.: Artificial Intelligence in Cancer Diagnosis and Prognosis: Opportunities and Challenges. Cancer Lett. 2020a; 471: 61–71. Elsevier Ireland Ltd. PubMed Abstract | Publisher Full Text

[14] Lever J, Krzywinski M, Altman N: Classification Evaluation. Nat. Methods. 2016; 13(8): 603–604. Springer Science and Business Media LLC. Publisher Full Text

[15] Ling AY, Kurian AW, Caswell-Jin JL, et al.: Using Natural Language Processing to Construct a Metastatic Breast Cancer Cohort from Linked Cancer Registry and Electronic Medical Records Data. JAMIA Open. 2019; 2(4): 528–537. Oxford University Press. PubMed Abstract | Publisher Full Text

[16] Liu Y, Jain A, Eng C, et al.: A Deep Learning System for Differential Diagnosis of Skin Diseases. Nat. Med. 2020; 26(6): 900–908. Nature Research. PubMed Abstract | Publisher Full Text

[17] Manica M, Oskooei A, Born J, et al.: Towards Explainable Anticancer Compound Sensitivity Prediction via Multimodal Attention-based Convolutional Encoders. Mol. Pharm. 2019; 16: 4797–4806. American Chemical Society. Publisher Full Text

[18] Matlak D, Szczurek E: Epistasis in Genomic and Survival Data of Cancer Patients. PLoS Comput. Biol. 2017; 13(7): e1005626. Public Library of Science. PubMed Abstract | Publisher Full Text

[19] Oktay O, Nanavati J, Schwaighofer A, et al.: Evaluation of Deep Learning to Augment Image-Guided Radiotherapy for Head and Neck and Prostate Cancers. JAMA Netw. Open. 2020; 3: e2027426. American Medical Association. PubMed Abstract | Publisher Full Text

[20] Sanchez-Vega F, Mina M, Armenia J, et al.: Oncogenic Signaling Pathways in The Cancer Genome Atlas. Cell. 2018; 173(2): 321–337.e10. Cell Press. PubMed Abstract | Publisher Full Text

[21] Sarker IH, Kayes ASM, Watters P: Effectiveness Analysis of Machine Learning Classification Models for Predicting Personalized Context-Aware Smartphone Usage. Journal of Big Data. 2019; 6(1). SpringerOpen. Publisher Full Text

[22] Siegel RL, Miller KD, Jemal A: Cancer Statistics, 2020. CA Cancer J. Clin. 2020; 70(1): 7–30. Wiley. Publisher Full Text

[23] Smith JC, Sheltzer JM: Systematic Identification of Mutations and Copy Number Alterations Associated with Cancer Patient Prognosis. elife. 2018; 7(December). eLife Sciences Publications Ltd. PubMed Abstract | Publisher Full Text

[24] Takiar R, Nadayil D, Nandakumar A: Projections of Number of Cancer Cases in India (2010-2020) by Cancer Groups. Asian Pac. J. Cancer Prev. 2010; 11(4): 1045–1049. Asian Pacific Organization for Cancer Prevention. PubMed Abstract

[25] Wong TT, Yeh PY: Reliable Accuracy Estimates from K-Fold Cross Validation. IEEE Trans. Knowl. Data Eng. 2020; 32(8): 1586–1594. IEEE Computer Society. Publisher Full Text

[26] Wright A, McCoy AB, Henkin S, et al.: Use of a Support Vector Machine for Categorizing Free-Text Notes: Assessment of Accuracy across Two Institutions. J. Am. Med. Inform. Assoc. 2013; 20(5): 887–890. PubMed Abstract | Publisher Full Text

[27] Xu Y, Goodacre R: On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. Journal of Analysis and Testing. 2018; 2(3): 249–262. Nonferrous Metals Society of China. PubMed Abstract | Publisher Full Text

[28] Yadav S, Shukla S: Analysis of K-Fold Cross-Validation over Hold-Out Validation on Colossal Datasets for Quality Classification. Proceedings - 6th International Advanced Computing Conference, IACC 2016. 2016: 78–83. Institute of Electrical and Electronics Engineers Inc. Publisher Full Text

[29] Zhang H, Lin Z, Jiang Y: Overfitting and Underfitting Analysis for Deep Learning Based End-to-End Communication Systems. 2019 11th International Conference on Wireless Communications and Signal Processing, WCSP 2019. 2019. Institute of Electrical and Electronics Engineers Inc. Publisher Full Text

[30] Zhang SK, Zheng R, Chen Q, et al.: Oral Cancer Incidence and Mortality in China, 2011. Chin. J. Cancer Res. 2015; 27(1): 44–51. AME Publishing Company. PubMed Abstract | Publisher Full Text

[31] Zhang W, Edwards A, Flemington EK, et al.: Racial Disparities in Patient Survival and Tumor Mutation Burden, and the Association between Tumor Mutation Burden and Cancer Incidence Rate. Sci. Rep. 2017; 7(1). Nature Publishing Group. PubMed Abstract | Publisher Full Text

K-nearest-neighbor algorithm to predict the survival time and classification of various stages of oral cancer: a machine learning approach

Abstract

Keywords

Revised Amendments from Version 1

Introduction

Research background

Methods

Figure 1. Flowchart of the proposed model described in three phases.

Phase – I: Data acquisition

Phase – II: Data analysis

Table 1. Values before imputation.

Table 2. Values after imputation.

Phase – III: Survival time prediction and classification of oral cancer into various stages

(1)

(2)

Results

Comparison of actual and predicted values through error metric – Mean absolute error

(3)

Figure 2. Relationship between ‘k’ value used in the experiment and MAE score for predicting survival time of oral cancer patients.

Classification of records to a definite oral cancer stage

Validation

Error estimation for prediction of survival time using the hold-out cross-validation method

Error estimation for prediction of survival time using the k-fold cross-validation method

(4)

Validating the classification of patients’ records into specific oral cancer stage through f-measures

(5a)

(5b)

(5c)

Figure 4. Screenshot of the accuracy metrics obtained by the proposed model.

Figure 5. KNN classifier performance evaluation.

Comparison of KNN with other traditional ML algorithms

Table 3. Classifier performance metrics.

Discussion

Conclusion

Data availability

Acknowledgment

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated