ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms

[version 1; peer review: 1 approved with reservations, 1 not approved]
PUBLISHED 25 Jun 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Background

COVID-19 is a global public health problem.

Aim

The main objective of this research is to evaluate and compare the performance of the algorithms: Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, and Neural Network, using metrics such as precision, recall, F1-score and accuracy.

Methods

A dataset (n=138) was used, with numerical and categorical variables. The algorithms Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, and Neural Network were considered. These were trained using an 80-20 ratio. The following metrics were evaluated: precision, recall, F1-Score, and 5-fold stratified cross-validation.

Results

The Random Forest algorithm was superior, achieving a maximum score of 0.9727 in cross-validation. The correlation analysis identified ferritin (0.8277) and oxygen saturation (-0.6444). The heuristic model was compared with metaheuristics models. Models obtained through metaheuristic search could maintaining the metrics with 3 variables and stable weight distribution. A perplexity analysis it allows to differentiate between the best models. The features of creatinine and ALT are highlighted in the model with the best CV score and the lowest perplexity.

Conclusion

Comparative analysis of different classification models was carried out to predict the severity of COVID-19 cases with biological markers.

Keywords

Biological markers, Cross-validation, Ferritin, Machine learning, Metaheuristics, Oxygen saturation, Random forest.

1. Introduction

The coronavirus disease COVID-19 is a significant global public health problem. Given the ease with which new strains can emerge, it is essential to investigate and understand their pathophysiology using precise techniques.1 Machine learning (ML) emerges as a promising tool, offering the possibility of improving precision and reducing the time of variable analysis to deeply understand the pathophysiology induced by COVID-19,2 and, consequently, improve patient treatment.

When using machine learning in the study of this clinical condition, it is necessary to choose between supervised or unsupervised learning and determine the appropriate algorithm, such as Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, Neural Network (MLP), among others. Various studies3,4,5 have demonstrated the effectiveness of these algorithms in biomedical problems. In relation to COVID-19, there is also evidence6 that precise biological markers can be decisive in the patient's prognosis.

Regarding the prediction of COVID-19 based on biomarkers in the study by Gharib et al.,7 inflammatory biomarkers were evaluated in 150 Egyptian patients and risk factors associated with the severity of COVID-19 patients. The study found a significant negative correlation between percent oxygen saturation and serum levels of inflammatory markers, including ferritin.

Another study with 50 patients infected with COVID-19 conducted in Peshawar by Khan et al.,8 an increase in the levels of CRP, ferritin, and IL-6 was detected, as well as changes in the count of neutrophils and lymphocytes. The authors concluded that elevation in CRP and ferritin are linked to secondary bacterial infections and adverse clinical outcomes. Findings from a meta-analysis suggest that serum ferritin levels correlate with the severity of COVID-19. Specifically, COVID-19 patients showed markedly higher ferritin levels compared to controls, with a SMD (Standardized Mean Difference) of -0.889 and a 95% CI of (-1.201, - 0.577).

Furthermore, patients with severe to critical COVID-19 symptoms showed elevated ferritin levels compared to those with mild to moderate symptoms, with an SMD of 0.882 and a 95% CI of (0.738, 1,026). Also significant was the finding that nonsurvivors had a pronounced increase in ferritin levels compared to survivors, with an SMD of 0.992 and a 95% CI of (0.672, 1.172). These observations emphasize the potential usefulness of serum ferritin as a biomarker in the management of COVID-19, although the presence of other comorbidities and confounding factors require cautious interpretation of the results.9

A study conducted in Alexandria, Egypt by Abdelhalim et al.,10 analyzed 210 non-hospitalized patients with confirmed COVID-19, aged between 14 and 75 years (mean age 44.5 ± 30.5). It was observed that 71.4% of these patients had high levels of serum ferritin, identifying ferritin as a significant biomarker for the diagnosis of COVID-19 in this population, with a value p = 0.014738. A comprehensive review of databases, including MEDLINE, EMBASE and others, identified relevant studies related to laboratory parameters in COVID-19 cases of different severities. Of 9,620 records, 40 studies with a total of 9,542 patients were included in the final analysis. The results showed that lymphopenia, thrombocytopenia and elevated levels of interleukin-6, ferritin, among other biomarkers, were associated with severe and fatal cases of COVID-19. In particular, elevated levels of interleukin-6 and hyperferritinemia were identified as indicators of systemic inflammation and a poor prognosis in patients with COVID-19.11 In a study conducted at the Department of Pathology and Laboratory Medicine, Aga Khan University (AKU), Karachi by Sibtain et al.,12 medical records of patients hospitalized with confirmed COVID-19 from March 1 to August 10, 2020 were reviewed. 157 patients were included in the final analysis, of which 108 were men and 49 were women. The analysis revealed a significant difference in ferritin levels between categories based on COVID-19 severity and mortality. Through binary logistic regression, ferritin was identified as an independent predictor of all-cause mortality in patients with COVID-19, with an AUC of 0.69 in the ROC analysis. The study concludes that serum ferritin concentration appears as a promising predictor of mortality in cases of COVID-19. In the work of Samprathi et al.,13 an extensive research protocol was designed for patients with COVID-19 that varies depending on the severity of symptoms and the presence of comorbidities. For those asymptomatic or with mild symptoms without comorbidities, no further investigations were requested. Patients with mild or moderate comorbidities, upon admission, needed tests such as CBC, CRP, serum creatinine, and liver function tests. In the presence of abnormalities, additional investigations were requested.

For severe cases, tests such as PT, APTT, INR and specific biomarkers were added, while for critical cases serial IL-6 and lactate levels were requested. Monitoring for hospitalized patients was performed using CBC and CRP every 48 to 72 hours. However, serum ferritin is not recommended to monitor response to treatment. For children with suspected MIS-C, a stepwise testing strategy is suggested, including cytokine testing and SARS-CoV-2 serology.

To predict the outcome in patients with COVID-19, AI techniques were successfully used, specifically the Random Forest and AdaBoost algorithms. These algorithms, using varied patient data, achieved a precision of $0.94$ and an F1 Score of $0.86$. A correlation between gender and mortality was also highlighted, with the majority of patients between 20 and 70 years.14 In the work of Ahmed et al.,15 machine learning methods (Random Forest, Support Vector Machine, Logistic Regression, Decision Tree and k Nearest Neighbors) were compared. Balancing of the dataset is not mentioned in this study. In the study by Wang et al.,16 the use of Random Forest to predict the severity of Covid-19 is reported, reaching an accuracy of 0.9905. For this purpose, they applied balancing using the SMOTE technique, achieving an AUC of 0.846.

Similarly, Elif et al.,17 used the SMOTE technique to balance their data. The Random Forest model they developed in their research returned a precision of 0.9796 and an AUC of 0.959. It is highlighted that some classes in their study had predictions lower than 90%, while others achieved perfect metrics. Among the biomarkers they identified as significant were LDH, high leukocyte count and C-reactive protein. On the other hand, Patterson et al.,18 also made use of the SMOTE balancing technique and developed a model that presented perfect metrics in all evaluated categories. Although they do not explicitly detail the cross-validation process in their study, they mention having used 10-fold CV (without specifying whether it was stratified) to perform an exhaustive search (grid-search) and determine the best hyperparameters. Additionally, it is noted that this validation was used exclusively for feature selection and not for dimensionality reduction.

In the work of Cui et al.,19 it is highlighted that its Random Forest model was able to predict the severity of the condition with a precision of 84.2% and an AUC of 0.874 for the first group and 0.842 for the second, both with an interval of confidence of 95%. LDH, D-Dimer and Fibrinogen stand out as key biomarkers. The main focus of their study, which included 437 patients, was to analyze the differences in these biomarkers between young (≤ 70 years) and older (≥ 70 years) people.

In the scope of the feature selection problem, metaheuristic algorithms have demonstrated effective addressing even in their most primitive versions on local and/or heuristic search mechanisms. In the work of Kabir et al.,20 used genetic algorithms for variable selection. In particular, this work uses a variant of the algorithm that allows reducing redundancy during the search for variables to improve performance.

In the work of Guha et al.,21 mentions the effectiveness that simulated annealing has had in multidimensional variable selection problems. Like the previous work, it proposes an improvement to the algorithm through its hybridization with another search mechanism that allows it to improve general performance. In the work of Chen et al.,22 it is described that the crow search algorithm was applied to variable selection problems. The authors faced late-stage diversity challenges, which are effectively addressed by a hierarchical adaptive approach. In a study by Bandyopadhyay et al.,23 on the detection of COVID-19 in radiological images, a two-stage pipeline consisting of feature extraction and selection was proposed. A CNN model based on the DenseNet architecture was used for feature extraction. To filter out non-informative and redundant features, the Harris Hawks optimization (HHO) algorithm with Simulated Annealing (SA) and chaotic initialization was used. When evaluating on the SARS-COV-2 CT-Scan dataset (2482 CT-scans), an accuracy of 98.85% was achieved with the inclusion of chaotic initialization and SA. The study reports a reduction in the number of selected features by 75%.

The main objective of this research is to evaluate and compare the performance of the algorithms: Random Forest, Support Vector Machine, Logistic Regression, Decision Tree, and Neural Network, using metrics such as precision, recall, F1-score and accuracy. With the models developed from these algorithms and metrics, we seek to identify which biological markers are affected in the pathophysiology of COVID-19 in order to monitor the clinical evolution and make medical decisions that contribute to the patient's recovery.

The structure of this document is as follows: section 2 details the methodology used in this research and the experimental design, section 3 presents the results. Section 4 focuses on discussing these results. Section 5 concludes the study. Section 6 shows some recommendations for future work.

1.1 Research question

What are the laboratory biomarkers to predict the severity of SARS-CoV-2 infection in patients from southeastern Mexico?

2. Methods

In this study, a comparative analysis of different classification models was carried out to predict the severity of COVID-19 cases.

2.1 Programming language details

The Python programming language was used for all calculations. For the classification models, the sci-kit learn suite was used in all cases.

Auxiliary libraries:

  • DEAP: Allows the rapid construction and prototyping of genetic optimization algorithms.

  • Simmaneal: Contains tools for building simulated annealing algorithms.

  • Pandas: visualization and data management.

  • Seaborn: Data visualization.

  • Matplotlib: Data visualization.

  • Tabulate: Tabulation of data.

  • Scipy: Calculation of statistics. It comes integrated into Numpy.

  • Numpy: Calculation of statistics. NumPy is a library for the Python programming language, adding support for large, multidimensional arrays and arrays, along with a large collection of high-level mathematical functions for operating on these arrays.

  • Imblearn: Data balance for MLP with SMOTE. Imbalanced-learn offers resampling techniques.

2.2 Dataset details

This study includes patients diagnosed with COVID-19 according to the World Health Organization (WHO) guidelines for the clinical management of severe acute respiratory infections due to SARS-CoV-2,24 who were admitted to the Intensive Care Unit (ICU) at the “Dr. Desiderio G. Rosado Carbajal” general hospital in the time period between April 1st and July 31st, 2021 in Comalcalco Tabasco, Mexico. The ethical and legal guidelines considered for sampling are described in De la Cruz-Cano et al.25

The dataset has 138 entries that correspond to patients and 60 columns that represent different features. The features encompass a wide range of data, including demographic information, laboratory test results, vital signs, symptoms, pre-existing conditions, and the severity of the COVID-19 illness divided into three classes that are considered mild, moderate and severe.

2.3 Dataset characterizationc

Next, the study population contained in the dataset used is characterized. In the dataset, there is a greater number of male patients (84) compared to female patients (54). It is also observed that the average age of the patients is approximately 59.9 years, with an age range that goes from 41 to 82 years, and a median of 58.

The severity of COVID-19 presents a significant proportion of patients (44.93%) with moderate severity of the disease (count of 62 and proportion of 0.4493), while mild and severe cases constitute 20.29% (count of 28 and proportion of 0.2029) and 34.78% (count of 48 and proportion of 0.3478) of the total respectively. Based on these data, it can be seen that the dataset is moderately unbalanced. Regarding comorbidities, presented in Table 1, the most prevalent is Heart Disease (Heart disease) with 97.10%, followed by Chronic Kidney Disease (CKD) with 94.20%. On the other hand, the least prevalent is Obesity, presented in only 34.06% of patients.

Table 1. Statistics of comorbidities.

ComorbidityMeanMinMax
Hypertension0.60870.01.0
T2DM0.39130.01.0
Dyslipidemia0.55070.01.0
CKD0.94200.01.0
Heart_disease0.97100.01.0
COPD0.91300.01.0
Obesity0.34060.01.0
Malignancy0.99280.01.0

Table 2 shows that the most common symptom among patients is Diarrhea, present in 94.93% of cases, while the least common symptom is Fever, reported only in 8.70% of patients.

Table 2. Statistics of symptoms.

SymptomMeanMinMax
Fever0.08700.01.0
Cough0.15940.01.0
Sore_throat0.39860.01.0
Myalgia0.71010.01.0
Headache0.47830.01.0
Diarrhea0.94930.01.0

2.4.1 Experimental design

The research strategy focused on two primary objectives. The first objective was to identify the most effective algorithm for analyzing the dataset, and to conduct a correlation analysis of the variables to provide a suitable context. This was illustrated in Figure X (referenced as Figure 1). The second objective was to conduct a thorough analysis of various optimization methods to identify the key variables for a simplified model. This model should maintain a minimum reduction in its predictive capabilities while providing insight on the importance of each feature. The methodology used to achieve these objectives is explained in detail below:

9ca59a22-adde-47b0-a699-d38f33a6770e_figure1.gif

Figure 1. Experimental process and algorithms used to predict the severity of COVID-19.

2.4.2 Data processing

The dataset was processed to contain a subset of 40 feature columns. The subset used in this work was generated from the first set by removing several features not directly related to the biomarkers. All variables were normalized, putting them on the same scale to facilitate analysis. The exclusion of certain variables is justified due to the analysis's focus on biomarkers and their importance in detecting the severity of COVID-19, which are features that can be objectively measured and that provide an assessment of a health condition.

2.4.3 Evaluation of algorithms and metrics

Five classification algorithms were evaluated and compared: Random Forest, Support Vector Machine, Logistic Regression, Decision Tree and Neural Network (Multi Layer Perceptron or MLP). To compare the performance of the models, evaluation metrics such as precision, recall, F1 score, and accuracy were used. The test used to discern the robustness of the methods was stratified five-segment cross-validation. During subsequent variable reduction experiments, an average of these metrics was employed as objective functions for the metaheuristic search algorithms.

2.4.4 Model selection

The selection of the optimal model depends on the analysis of variables and the analysis of variable reduction that provide indications of the importance of the biomarkers. Given that there is an imbalance in the dataset, the models that will be selected for these analyzes will be those that are balanced by weights (or SMOTE in the case of the MLP). This ensures that uneven distribution of data is taken into account and more reliable and representative results are provided.

2.4.5 Correlation analysis

Through correlation analysis, the variables (biological markers) associated with the severity of COVID-19 (COVID19_Severity) were identified. The correlated variables or biological markers are represented in a Table and a heat map of correlations.

2.4.6 Variable reduction

Heuristic experiments were carried out to progressively reduce the variables and the model performance was evaluated for each set of selected variables to obtain a model with a reduced number of variables without significantly compromising accuracy. In addition, variable selection experiments were performed using the following algorithms: genetic, simulated annealing, and crow search. These methods introduced variety to the models, ensuring that parameters were selected equitably to achieve similar time durations.

2.4.7 Overfitting analysis

To evaluate the possibility of overfitting in the generated models, difference tests between the training and validation sets were implemented through learning curves. The curves, produced with the learning_curve() function from the sklearn library, illustrate the performance of the model when varying the size of the training set. A threshold of 0.05 (or 5%) was set for the difference between training and validation scores. If the difference exceeded this threshold, the model was considered overfitted. This methodology facilitated the identification of feature sets that provided an optimal balance between accuracy and generalization., illustrate the performance of the model when varying the size of the training set. A threshold of 0.05 (or 5%) was set for the difference between training and validation scores. If the difference exceeded this threshold, the model was considered overfitted. This methodology facilitated the identification of feature sets that provided an optimal balance between accuracy and generalization.

3. Results

In this section, the results obtained in the study of predicting the severity of COVID-19 cases using different classification algorithms are presented. The algorithms evaluated include Random Forest, Support Vector Machine, Logistic Regression, Decision Tree and Neural Network (MLP). To evaluate the performance of each algorithm, the following metrics were used: precision, recall, F1-score and CV-Score (cross-validation).

3.1 Algorithm performance

The results in Table 3 provide a comparative view of the performance of machine learning algorithms based on various metrics.

Table 3. Basic metrics of algorithms sorted by best cross-validation (best algorithm shown in bold).

ModelCV ScoreROC AUCAccuracyRecallPrecisionF1 Score
Random Forest0.97271.0001.00001.00001.01.0000
Logistic Regression0.96360.98340.96430.97221.00.9649
SVM0.95450.99780.96430.97221.00.9649
Neural Network (MLP)0.94670.99550.92860.93891.00.9290
Decision Tree0.93640.96880.96430.96671.00.9641

This Table shows the basic metrics of the algorithms. It can be seen that all algorithms exhibit solid performance, with the precision, recall, F1-score, and ROC AUC value of each algorithm all exceeding 0.9. The Random Forest stands out as the best algorithm, which reaches the maximum value of 1.0000 in all metrics, with the exception of the cross-validation score, which is 0.9727. It's worth noting that, natively, Gradient Boosting (including XGBoost) and k Nearest Neighbors (kNN) do not allow balancing of classes and therefore they have not been considered in the comparison with the purpose of evaluating the models under equal conditions, in the same way, all the models have been initialized in random_state=42).

Table 4 shows the accuracy, recall and F1 scores for each class of each algorithm. Random Forest achieves scores of 1,000 in all classes, but other algorithms, such as SVM and Logistic Regression, also show high performance in several classes, reaching scores of 1,000 in several metrics. The results reported for each metric with respect to the Random Forest algorithm select it as the best algorithm and it is subsequently used for the variable reduction experiments and other comparisons used.

Table 4. Accuracy, Recall and F1 scores for each Class (metrics for the best model are shown in bold).

ModelClassAccuracy scoreRecall scoreF1 Score
Random ForestMild1.00001.00001.0000
Random ForestModerate1.00001.00001.0000
Random ForestSevere1.00001.00001.0000
Logistic RegressionMild0.85711.00000.9231
Logistic RegressionModerate1.00000.91670.9565
Logistic RegressionSevere1.00001.00001.0000
SVMMild0.85711.00000.9231
SVMModerate1.00000.91670.9565
SVMSevere1.00001.00001.0000
Neural Network (MLP)Mild0.85711.00000.9231
Neural Network (MLP)Moderate0.91670.91670.9167
Neural Network (MLP)Severe1.00000.90000.9474
Decision TreeMild1.00001.00001.0000
Decision TreeModerate0.92311.00000.9600
Decision TreeSevere1.00000.90000.9474

3.2 Correlation analysis

The characterization analysis of the variables contained in the subset of data under study and the correlation analysis are presented below. The features are ordered by correlation. Table 5 contains the averages and range of all features, the correlation value and the Pearson P value with interpretation to provide context of the dataset.

Table 5. Correlation, p-value and significance of biomarkers with the severity of COVID-19.

FeatureALL (n=138)Correlationp-valueSignificance
ferritine1166 (529-1810)0.82776.148e-36Very significant
sat. O280.4 (65-91)-0.64441.490e-17Very significant
Fibrinogen544 (222-891)0.53561.301e-11Very significant
DDimer1020 (166-2498)0.47314.631e-09Very significant
Glucose169 (82-502)0.46321.063e-08Very significant
Respiratory_Rate28.59 (26-33)0.44026.582e-08Very significant
Procalcitonine1.047 (0.04-14.52)0.38553.019e-06Very significant
IL6183 (48-295)0.34353.716e-05Very significant
Na137.8 (127-166)-0.30200.0003178Very significant
Creatinine0.9928 (0.3-6.4)-0.28150.0008228Very significant
Urea39.53 (26.9-130.5)-0.27200.001251Significant
Cl101.7 (91-128)-0.26270.001853Significant
LDH479 (215-948)0.23870.004817Significant
ALT51.1 (11-293)-0.2345-0.2345Significant
CRP267 (69-426)0.21660.01073Slightly significant
AST69.27 (13-365)-0.21160.01274Slightly significant
Heart_Rate87.53 (76-110)0.20970.01359Slightly significant
Albumin3.626 (2.4-4.9)-0.18070.03396Slightly significant
Hemoglobin13.42 (6.1-16.9)0.16850.04824Slightly significant
INR1.077 (0.92-1.76)-0.15630.06709Not significant
PT13.32 (11.5-21.3)-0.15490.06959Not significant
APTT36.45 (24.1-67)0.14650.08634Not significant
Uric_acid5.302 (3-6.8)0.14470.09038Not significant
K4.628 (3.1-7.9)0.11800.1681Not significant
MgSerico2.03 (1.5-3.6)0.11230.1897Not significant
Fosforo3.677 (2.8-6.9)0.10900.2033Not significant
Eosinophils0.02246 (0-0.5)-0.10710.2112Not significant
Age59.9 (41-82)-0.097900.2533Not significant
Lymphocyte0.992 (0.3-2.2)0.090820.2894Not significant
Total_Cholesterol202 (100-269)0.066390.4391Not significant
Triglycerides156 (77-342)0.048800.5697Not significant
Temp °C38.98 (37-40.1)0.047330.5814Not significant
Eosinophils0.442 (0-8)-0.042200.6231Not significant
Neutrophils80.95 (40.5-94.8)0.035100.6827Not significant
Lymphocyte11.88 (3-43)-0.032710.7033Not significant
CaTotal8.53 (7.1-9.2)0.018710.8276Not significant
WBC10.53 (1.7-24.9)0.017090.8423Not significant
Neutrophils8.852 (0.91-20.04)-0.011340.8950Not significant

Additionally, a bar graph of these correlations is shown in Figure 2.

9ca59a22-adde-47b0-a699-d38f33a6770e_figure2.gif

Figure 2. Correlation Bar Chart (The first chart presents correlations and the second presents absolute correlations).

In the original heat map, correlations clusters are distinguished that are characterized by variables with magnitudes greater than or close to 0.5. These magnitudes suggest a potential impact on the subsequent phase of feature selection. In particular, in the upper left corner of the map, a grouping stands out where variables such as ferritin and oxygen saturation present an absolute correlation greater than 0.5 with each other. Closely, fibrinogen shows a correlation of 0.47. Likewise, in the lower corner of the map, a strong correlation is observed between the variables Na, Creatinine, Urea and Cl. In addition, in relation to these, the variables ALT and AST also stand out.

Figure 3, presents a refined version of this heatmap. In it, significant correlations with the target variable are evident (> 0.45). It can be deduced from this heat map that these correlations should, with high probability, be the ones selected in the variable reduction procedures.

9ca59a22-adde-47b0-a699-d38f33a6770e_figure3.gif

Figure 3. Heat map of significant correlations (>0.45) between variables related to the target variable (correlations with magnitude greater than 0.5 are highlighted in red and bold).

3.3 Feature Selection

Four different approaches were used for feature selection: heuristic, genetic algorithm, simulated annealing algorithm and crow search algorithm. The two best models for each approach were identified (ranked based on cross-validation and number of variables). In the case of the heuristic approach, 3 and 4 variables were considered for the models. Once these models were found, it was decided that for metaheuristic approaches, the models would be encouraged to produce 3- and 4-feature solutions. The metaheuristic search algorithms were run ten times each and the two best models in each case were evaluated and compared at the end.

3.4 Heuristic Method

Several progressively generated models were trained by adding the features that were found to have statistical significance during the correlation analysis. Below is a brief description of them for context.

  • Ferritin: Protein that stores iron in the body.

  • Oxygen Saturation (Sat.O2): Percentage of hemoglobin that carries oxygen.

  • Fibrinogen: Protein that helps in blood clotting.

  • D-Dimer (DDimer): Fibrin degradation product in coagulation.

  • Glucose: Blood sugar level.

  • Respiratory Rate: Number of breaths per minute.

  • Procalcitonin: Marker of inflammation and immune response.

  • Interleukin 6 (IL6): Protein involved in the inflammatory response.

  • Sodium (Na): Essential electrolyte for fluid balance.

  • Creatinine: Muscle waste product eliminated by the kidneys.

  • Urea: Waste that is formed when proteins are metabolized.

  • Chloride (Cl): Chloride level in the blood.

  • LDH: Enzyme that indicates tissue damage.

  • ALT: Enzyme indicating liver damage.

  • CRP: Protein that indicates inflammation in the body.

  • AST: Enzyme indicating liver damage.

  • Heart Rate: Number of heartbeats per minute.

  • Albumin: Protein produced by the liver and found in blood plasma.

  • Hemoglobin: Protein in red blood cells that transports oxygen.

As mentioned before, a limited set of variables within such a list (see heatmap in Figure 3) are expected to be displayed during variable selection. The models were trained progressively and the graphical results can be seen in Figure 4.

9ca59a22-adde-47b0-a699-d38f33a6770e_figure4.gif

Figure 4. Vertical curves present the first appearance of a model with the best F1-Score (Green) or the best cross-validation accuracy (Magenta).

The values of the models represented in said graph are found in Table 6.

Table 6. Model metric results by adding statistically significant features and complete model (models that present some local improvement are shown in bold).

Added_featureAccuracyRecallF1ROC_AUCCV_Accuracy
Ferritine0.89290.89290.89220.98330.9000
Sat.O20.96430.96430.96491.00000.9636
Fibrinogen0.96430.96430.96491.00000.9273
DDimer0.96430.96430.96491.00000.9273
Glucose0.96430.96430.96491.00000.9455
Respiratory_Rate0.96430.96430.96491.00000.9545
Procalcitonine1.00001.00001.00001.00000.9545
IL60.96430.96430.96491.00000.9455
Na0.96430.96430.96491.00000.9364
Creatinine0.96430.96430.96491.00000.9636
urea0.96430.96430.96491.00000.9636
Cl1.00001.00001.00001.00000.9636
LDH1.00001.00001.00001.00000.9636
ALT1.00001.00001.00001.00000.9636
CRP1.00001.00001.00001.00000.9545
AST1.00001.00001.00001.00000.9727
Heart_Rate1.00001.00001.00001.00000.9636
Albumin1.00001.00001.00001.00000.9364
Hemoglobin1.00001.00001.00001.00000.9727
ALL1.00001.00001.00001.00000.9727

During the training experiment with progressive addition of variables, 39 models were trained. The intermediate models, 20 to 38, have been omitted from Table 6 since their metrics were identical or inferior to subsequent models and the added variables did not show statistical significance during the correlation analysis. From this Table, procalcitonin and AST are highlighted as variables that increase the performance of the models; models were generated and analyzed with the progressive addition of these features. These models are referred to in this research simply as “Model 1” and “Model 2”.

In Table 7, a redistribution of the weights of the variables between Model 1 and Model 2 is observed. The inclusion of the variable AST in Model 2 entails a redistribution of the weight, especially with a notable reduction in the weight of procalcitonine from 0.2235 to 0.1140. However, despite the lower correlation of AST with severity (-0.2116), the inclusion of AST in Model 2 appears to contribute to a marginal improvement in the cross-validation accuracy of 0.9546 compared to 0.9636 for Model 1.

Table 7. Weight of the variables in the outstanding models and their correlation.

ModelFerritinSat.O2ProcalcitoninAST
Model 1 - Weight on model0.44650.32990.2235-
Modelo 2 - Weight on model0.48490.32080.11400.0825
Correlation with Severity0.8277-0.64440.3855-0.2116

It is notable that both models achieve perfect accuracy and perfect F1-score, indicating their effectiveness in classification. The main difference between them lies in their performance in cross-validation, where Model 2 outperforms Model 1 with an increase of approximately 0.007 when including the variable LDH. This suggests that the inclusion of LDH may improve the robustness of the model and its generalization ability.

Despite the marginal improvement, the decrease in cross-validation metrics in Model 2 compared to Model 1 suggests that the inclusion of the AST variable could be adding additional complexity without necessarily improving the generalizability of the model. The apparently unequal distribution of weights in the second model compared to the first could be a sign of overfitting.

A joint superficial analysis of Tables 7 and 8, as well as the behavior of the metrics between models in Figure 4, suggests that although the Models 1 and 2 show excellence in their metrics, with a precision, F1-score and AUC of $1.0000$, this perfect performance could indicate a possible overfitting to the training data. To this end, subsequent evaluations are based on cross-validation as an indicator of performance and generalization (see Figure 6).

Table 8. Comparison of the metrics of the two outstanding models.

MetricsModel 1Model 2
Precision1.00001.0000
F1-score1.00001.0000
AUC1.00001.0000
Cross-validation0.96360.9636

3.5 Metaheuristic method 1 (Genetic algorithm)

The algorithm used as an optimization function a combination of all the scores on average, so that it found the model that best satisfied all the metrics at the same time. The specifications of the genetic algorithm are as follows:

Individual Representation: Binary string of length equal to the number of features.

Aptitude Function: Average precision, recall, accuracy and F1 score.

Crossing: Two point crossing.

Mutation: Bit flip mutation with probability 0.05.

Selection: Tournament selection with size 3.

Population Size: 50.

Number of Generations: 20.

Hall of Fame: Best individual through all generations.

The performance of the best models found with the metaheuristic method is shown below in Table 9.

Table 9. Summary of the results of the models found by the genetic algorithm.

CV ScoreAccuracyRecallF1 ScoreFeatures
0.97271.0001.0001.000['ferritine', 'urea', 'Cl']
0.97271.0001.0001.000['ferritine', 'sat. O2', 'creatinine', 'ALT']
0.97271.0001.0001.000['ferritine', 'sat. O2', 'Glucose', 'procalcitonine']
0.95451.0001.0001.000['ferritine', 'sat. O2', 'procalcitonine', 'urea']
0.95451.0001.0001.000['sat. O2', 'Respiratory_Rate', 'creatinine', 'Cl']
0.92731.0001.0001.000['ferritine', 'DDimer', 'procalcitonine', 'Cl']
0.92730.96430.96430.9644['Respiratory_Rate', 'Na', 'creatinine']
0.91821.0001.0001.000['ferritine', 'procalcitonine', 'CRP']
0.88181.0001.0001.000['ferritine', 'IL6', 'ALT']
0.82731.0001.0001.000['Respiratory_Rate', 'IL6', 'Cl']

3.6 Metaheuristic method 2 (Simulated annealing algorithm)

This algorithm used a similar objective function as the previous algorithm. The specifications of the simulated annealing algorithm are as follows:

Individual Representation: Binary string of length equal to the number of features.

Power Function: Negative of the average of the F1 scores obtained through cross-validation.

Movement: Selects and deselects features randomly until the total of selected features equals 4.

State Initialization: Random binary string with exactly 4 selected features.

Initial temperature: 10.0.

Final Temperature: 0.01.

Number of Iterations: 150.

The performance of the best models found with metaheuristic method 2 is shown below in Table 10.

Table 10. Summary of results from simulated annealing iterations.

CV ScoreAccuracyRecallF1 ScoreFeatures
0.97271.00001.00001.0000['ferritine', 'sat. O2', 'LDH', 'ALT']
0.96361.00001.00001.0000['ferritina', 'sat. O2', 'Fibrinogen', 'ALT']
0.96361.00001.00001.0000['ferritine', 'sat. O2', 'Fibrinogen', 'ALT']
0.96361.00001.00001.0000['ferritine', 'sat. O2', 'Fibrinogen', 'ALT']
0.95451.00001.00001.0000['ferritine', 'sat. O2', 'Fibrinogen', 'urea']
0.95451.00001.00001.0000['ferritine', 'sat. O2', 'urea', 'LDH']
0.95451.00001.00001.0000['ferritine', 'sat. O2', 'procalcitonine', 'urea']
0.93641.00001.00001.0000['ferritine', 'sat. O2', 'creatinine', 'LDH']
0.93640.96430.96430.9649['ferritine', 'sat. O2', 'Fibrinogen', 'Cl']
0.92730.96430.96430.9649['ferritine','Respiratory\_Rate','procalcitonine','LDH']

3.7 Metaheuristic method 3 (Crow search algorithm)

This algorithm used a similar objective function as the previous algorithms. The specifications of the crow search algorithm are as follows:

Individual Representation: Binary string of length equal to the number of features.

Aptitude Function: Average of cross-validation score and F1 score.

Crow Movement: If the probability of consciousness is less than 0.2, it moves towards the target, otherwise it moves randomly.

Mutation: Flip bit with probability 0.2 (probability of consciousness).

Penalty: Penalizes solutions with less than 3 features or 5 or more features.

Population Size: 50.

Number of Iterations: 30.

Hall of Fame: Better individual through all iterations.

The performance of the best models found with metaheuristic method 3 is shown below in Table 11.

Table 11. Summary of the results of the models found by the crow search algorithm (the best 2 models are highlighted in bold).

CV ScoreAccuracyRecallF1 ScoreFeatures
0.95451.00001.00001.0000['ferritine', 'sat. O2', 'urea', 'LDH']
0.94551.00001.00001.0000['ferritine', 'sat. O2', 'Na', 'LDH']
0.93640.92860.92860.9274['ferritine', 'DDimer', 'Respiratory_Rate', 'procalcitonine']
0.93640.92860.92860.9304['ferritine', 'sat. O2', 'DDimer', 'Na']
0.92731.00001.00001.0000['ferritine', 'DDimer', 'creatinine', 'urea']
0.92731.00001.00001.0000['ferritine', 'DDimer', 'creatinine', 'urea']
0.92731.00001.00001.0000['ferritine', 'DDimer', 'procalcitonine', 'Cl']
0.91820.96430.96430.9641['ferritine', 'DDimer', 'Glucose', 'urea']
0.90910.96430.96430.9649['ferritine', 'DDimer', 'procalcitonine']
0.90911.00001.00001.0000['ferritine', 'DDimer', 'urea']

3.8 Comparison of reduced models

Tables 12 and 13 show the effectiveness and weighting of the features of the models under study (AUC and Specificity values are omitted as they are identical in all cases and equal to 1.0000).

Table 12. Feature selection results and model performance.

FeaturesCV ScoreF1 ScoreAccuracyRecallSelection Method
ferritine, urea, Cl0.97271.00001.00001.0000Genetic algorithm
ferritine, sat. O2, creatinine, ALT0.97271.00001.00001.0000Genetic algorithm
ferritine, sat. O2, LDH, ALT0.97271.00001.00001.0000Simulated Annealing
ferritine, sat. O2, procalcitonine0.96361.00001.00001.0000Heuristic
ferritine, sat. O2, procalcitonine, AST0.96361.00001.00001.0000Heuristic
ferritine, sat. O2, Fibrinogen, ALT0.96361.00001.00001.0000Simulated Annealing
ferritine, sat. O2, urea, LDH0.95451.00001.00001.0000Crow Search
ferritine, sat. O2, Na, LDH0.94551.00001.00001.0000Crow Search

Table 13. Performance and feature weights of the models (F1-Score, Accuracy and Recall omitted due to perfect scoring). The best models are highlighted in bold.

ModelCaracteristicsWeightCV Score
M3ferritine, urea, Cl0.5650, 0.2660, 0.16900.9727
M4ferritine, sat. O2, creatinine, ALT0.4826, 0.3323, 0.07377, 0.11130.9727
M5ferritine, sat. O2, LDH, ALT0.4920, 0.3334, 0.06770, 0.10690.9727
M1ferritine, sat. O2, procalcitonine0.4465, 0.3300, 0.22350.9636
M2ferritine, sat. O2, procalcitonine, AST0.4840, 0.3069, 0.0901, 0.11910.9636
M6ferritine, sat. O2, Fibrinogen, ALT0.4668, 0.3250, 0.1175, 0.090720.9636
M7ferritine, sat. O2, urea, LDH0.4781, 0.3426, 0.1019, 0.077320.9545
M8ferritine, sat. O2, Na, LDH0.4780, 0.3598, 0.09156, 0.070580.9455

Certain clear patterns were observed in feature selection and its impact on model performance.

In Table 13, the models M3, M4, and M5, which have a cross-validation (CV) score of 0.9727, you can see the distribution of assigned weights of 0.5650 (M3), 0.4826 (M4) and 0.4920 (M5) respectively, for the Ferritin feature. This corroborates the importance of the variable in predicting the severity of COVID-19. In this table it can be seen that the ferritine feature is repeated in all models, and its weight is considerable in those with the best CV Scores (Models M3, M4, and M5), reinforcing its relevance in prediction. The variable sat. O2 is also selected consistently across almost all models, suggesting its predictive value, although its weight in the models varies, which could have implications for unbalancing the performance of each model.

Features such as urea, creatinine, LDH, ALT and procalcitonine are selected in certain models, with variations in their weights. This could suggest that its importance may fluctuate depending on the context of the other features in the model. Although all models achieve high accuracy, CV Scores vary, suggesting differences in their ability to generalize to new data. Models M3, M4, and M5 stand out and tie with the CV Score of 0.9727, an indicator of better generalization ability than the rest of the models generated in the list.

On the other hand, models M7 and M8 have the lowest CV Scores (0.9545 and 0.9455, respectively), which could indicate a lower robustness in generalization or even overfitting (see Figure 5 in the overfitting analysis). We can relate these results to the weights assigned to their features and the selection method used. In the model M4, the additional features creatinine and ALT are included with weights of 0.07377 and 0.1113 respectively. In contrast, model M3 uses urea and Cl with weights of 0.2660 and 0.1690. Despite these changes, the CV score remains the same at $0.9727$ for both models.

9ca59a22-adde-47b0-a699-d38f33a6770e_figure5.gif

Figure 5. Cross-validation is performed for each model, along with the methods employed for variable selection.

Any model suspected of overfitting is represented with a grid.

Model M5, which also has a CV score of 0.9727, includes LDH and ALT with weights of 0.06770 and 0.1069, respectively, instead of creatinine which is found in the model M4. Despite this change, the weights assigned to ferritin and O2 saturation are very similar to those of the M4 model. A small difference of 0.006070 is shown when LDH is replaced by Creatinine. Models M4, M2 and M6 have CV scores of 0.9636. Despite sharing ferritin and O2 saturation with comparable weights with models M3, M4 and M5, these models also include procalcitonin, AST and fibrinogen with weights of 0.2235 (M1), 0.1191 (M2) and 0.1175 (M6) respectively.

Models M7 and M8 have the lowest CV scores (0.9545 and 0.9455) and include urea and Na with weights of 0.1019 and 0.09156. Although urea is in the best model and both urea and Na are highly correlated between the variables with statistical significance and with others, its performance is lower than that of other models. One possible reason is the interaction with the oxygen saturation variable, which unbalances the weight distribution, which could lead to overfitting.

A better qualitative comparison perspective and the importance of features can be appreciated in Table 14. In this table, the distribution of features in the models can be seen, where it can be observed that ferritine is present in all prediction models, due to its high correlation with the target variable and its high Pearson correlation. Following ferritine, oxygen saturation is present in almost all models, except the third one (which, according to Table 13, has a reasonable weight distribution and a high CV Score, so it could be possible to predict the disease severity without necessarily using oxygen saturation). Both features (ferritin and oxygen saturation) are considered as main features because they are present in all or almost all models, and this observation coincides with the calculated correlation for these features with the target feature. In addition, for the rest of the features, the complementary behavior of them with respect to the most frequent features can be seen. It is believed that those "secondary features" can be implicitly selected under diversity and cross-correlation criteria. It does not seem, in the first instance, to have any preferential distribution of secondary features, as the frequency values for these features fluctuate between 1 and 3 appearances. Being the secondary features most frequently appearing the LDH and ALT, whose target variable correlations are not necessarily the highest, but their interpretation with the p-value is deemed as "significant" (see Table 5).

Table 14. Using features by model and assigned correlation.

FeatureHeuristicGenetic algorithmSimulated AnnealingCrow SearchCorrelation
M1M2M3M4M5M6M7M8
FerritinXXXXXXXX0.8277
Oxygen saturationXXXXXXX-0.6444
FibrinogenX0.5356
ProcalcitoninXX0.3855
NaX-0.3020
CreatinineX-0.2815
UreaXX-0.2720
ChlorineX-0.2627
LDHXXX0.2387
ALTXXX0.2387
ASTX-0.2116

3.9 Overfitting analysis

An overfitting analysis was performed to evaluate the ability of the Random Forest model to generalize to new data. The overfitting analysis was carried out using two complementary approaches.

Learning curve: Learning curves are generated using code:

train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

In this code snippet, the learning_curve() function creates a set of training and validation scores for different training set sizes. An overfitted model will show a large discrepancy between training and validation scores, especially for larger training sets. In Figure 6 you can see these curves (the red one represents validation and the green one represents training).

9ca59a22-adde-47b0-a699-d38f33a6770e_figure6.gif

Figure 6. The validation and training curves appear close and within the threshold in almost all cases except in model 8.

Difference between training and validation scores: The overfitting test is performed using the following code:

if (train_scores_mean[-1] - test_scores_mean[-1]) > 0.05:  overfitting = "Yes"else:  overfitting = "No"

This code snippet calculates the difference between the mean training and validation scores for the largest training set size. If this difference is greater than 0.05, the model is considered to be overfitting.

The training and cross-validation results that were used to determine if the model is overfitted are presented in Table 15.

Table 15. Results of the training scores and cross-validation of the reduced models.

CaracteristicsCV ScoreTrain ScoreSelection MethodOverfitting
ferritine, urea, Cl0.97270.990909Genetic algorithmNo
ferritine, sat. O2, creatinine, ALT0.97270.990909Genetic algorithmNo
ferritine, sat. O2, LDH, ALT0.97271Simulated AnnealingNo
ferritine, sat. O2, procalcitonine0.96360.990909HeuristicNo
ferritine, sat. O2, Fibrinogen, ALT0.96361Simulated AnnealingNo
ferritine, sat. O2, urea, LDH0.95451Crow SearchNo
ferritine, sat. O2, procalcitonine, LDH0.94550.990909HeuristicNo
ferritine, sat. O2, Na, LDH0.94551Crow SearchYes

3.10 Complementary analysis

The perplexity parameters of the best models are presented below in Table 16 and models 9, 10 and 11 have been added, generated progressively and reporting the biomarkers Ferritin + Oxygen saturation + Fibrinogen because they are the three with the highest correlation with the target variable. In addition, model 12 has been added, which contains all the variables. The specificity is the average of the three classes for this case.

Table 16. Perplexity of models ordered from highest to lowest with additional metrics.

ModelCaracteristicsAccuracyCV ScorePerplexitySpecificity
M1ferritine, sat. O2, procalcitonine10.96361.04781
M4ferritine, sat. O2, creatinine, ALT10.97271.05021
M10ferritine, sat. O20.96430.96361.06330.9849
M3ferritine, urea, Cl10.97271.06351
M7ferritine, sat. O2, urea, LDH10.95461.06711
M2ferritine, sat. O2, procalcitonine, AST10.96361.06831
M5ferritine, sat. O2, LDH, ALT10.97271.07431
M6ferritine, sat. O2, Fibrinogen, ALT10.96361.08501
M8ferritine, sat. O2, Na, LDH10.94551.09981
M11ferritine, sat. O2, Fibrinogen0.96430.92731.1260.9849
M 12ALL10.97271.1911
M9ferritine0.89290.91.2540.9444

A minimal difference between models can be noted in this table. One of the models from the heuristic algorithm presents the least perplexity of the set, closely followed by one of the best models according to the CV score. The one that is considered the best algorithm in Table 15 does not have the slightest measure of perplexity. Even so, in second place is one of the best models found in said table, which is also the only one of those at the top of the table that has a perfect Train Score. These results could indicate that by reducing model variables, there is a possibility of a slight degradation in performance, which could go unnoticed compared to classical accuracy tests.

This particular measure was adopted with a pragmatic consideration, since the probability distribution could be used experimentally as an indicator of the improvement or worsening of the condition in the absence of data of a temporal nature. This is because this distribution exhibits the proximity and similarity of a sample between different categories. It is suggested that perplexity can play a significant role as a metric to perform a differential analysis between the most prominent models.

The model highlighted in the table in question (determined by its CV score and its lower perplexity) stands out for a choice of biomarkers of a varied nature, which could suggest greater robustness of the model itself.

4. Discussion

4.1 Importance of variables

This study supports the conclusions presented by Gharib et al.,7 which emphasizes the relevance of negative correlations with oxygen saturation and positive correlations with ferritin when investigating biomarkers of inflammation related to the severity of COVID-19. Furthermore, it highlights the importance of biomarkers such as IL-6, LDH, CRP and D-dimer. The correlations with the variables mentioned in the study by Khan et al8., are also corroborated during the analysis phase in this study, especially regarding ferritin (highly significant), IL-6 (highly significant) and CRP (slightly significant). Similarly, the study conducted in Alexandria by Abdelhalim et al10. found that 71.4% of non-hospitalized patients had elevated levels of serum ferritin, reaffirming its relevance as a biomarker in the diagnosis of COVID-19.

The results obtained could indicate that, beyond the individual impact that each variable may have, the interaction between various biomarkers could be of especially significant importance. Most of the models investigated integrate fundamental features, such as ferritin levels and oxygen saturation, along with other secondary features, such as fibrinogen, sodium (Na) levels, creatinine, urea, chloride (Cl), as well as the levels of the ALT and AST enzymes. In the differential analysis of perplexity, the three outstanding models were identified, and it was observed that the model that presented the best performance (determined by its highest score in the CV Score and its lowest perplexity) uses variables that reveal liver damage (represented by ALT), kidney damage (evidenced by creatinine), biomarkers that denote immunoinflammatory alterations (such as ferritine), as well as the balance in oxygenation (sat.02). These observations could suggest that feature diversity could play an important role in decreasing model perplexity and granularly increasing overall prediction performance. Within the focus of the study, it has not been found whether this importance of diversity has already been reported previously.

A point to highlight is the indication of Samprathi et al.13 on the non-recommendation of serum ferritin to monitor treatment response, despite its prominence in predictions. Additionally, a comprehensive review11 identified lymphopenia, thrombocytopenia, interleukin-6, ferritin, among other biomarkers, as associated with severe and fatal cases of COVID-19. In particular, hyperferritinemia was highlighted as an indicator of systemic inflammation and a poor prognosis.

These findings, together with the identification of ferritin as an independent predictor of COVID-19 mortality in the study by Sibtain et al.,12 highlight the need to continue investigating and evaluating the function and predictive value of these biomarkers in the disease using the interaction between variables as a framework.

4.2 Machine learning

This work is similar to the approach used in the work of Ahmed et al.15 However, there are fundamental differences in methodology. One of these is the deliberate exclusion of models that do not easily admit balance, as is the case of kNN. This choice was made despite the fact that many studies, including the recently mentioned,15 included it among their compared methods, along with Random Forest, Support Vector Machine, Logistic Regression and Decision Tree, such as the study by14 which used Random Forest and AdaBoost.

The current work has prioritized a preselection of biomarkers based on the Pearson p-value, facilitating the reduction of the search space for the metaheuristic algorithms that were used for subsequent feature selection. This preselection is consistent with the study by Elif et al.,17 which identified significant biomarkers with a similar method, agreeing with this study in biomarkers such as LDH, high leukocyte count and C-reactive protein. The models presented in said study achieved an accuracy of 0.9796 and an AUC value of 0.959, it can be suggested that the high prediction values may indeed be related to the selection of statistically significant features, given that this study reports exceptional metrics and AUC. A relevant aspect is that in the current work, unlike,26 the data used during the correlation analysis have been used for subsequent variable reduction experiments. In other similar works26,15 the 5-fold stratified cross-validation, considered fundamental in the analysis of this work to evaluate the performance of the models, has not been implemented. In the work of Patterson et al.18 A 10-fold CV approach is used which appears to be quite effective. This approach is feasible in this work given the magnitude of the dataset (n=225).

It is suggested that data balancing could present a significant improvement in model construction. Xiong et al.,26 carried out an analysis that is similar to the approach of this study. Despite working with a more extensive dataset (n=287) and having a notable diversity in features, they do not mention a specific balancing process. From the analysis of its results, the Random Forest model stood out with an AUC of 0.970, while the SVM model recorded the highest precision, with 88.5%. However, despite its extension, our study, with a sample of n=138, delves deeper and expands on the features of said data, in addition to using balancing techniques. For their part, Gok et al.17 worked with a larger dataset (n=863), which was balanced using the SMOTE technique. Of the 8 trained models, Random Forest had the best performance, achieving an accuracy of 0.98, similar to the results of our study. A key difference between the work of Gok et al.17 and ours lies in the nature of the variables used for prediction. While Gok et al.17 they did not use laboratory results that contained biomarkers such as IL-6, LDH or Ferritin, in this study they have been essential in the analysis. The latter can make prediction difficult because the biomarkers are not very specific, which highlights the high precision of the models generated in said study. It is vital to highlight that the study by Gok et al.17 emphasized how data preprocessing and balancing positively influence the accuracy and robustness of the model.

It should be noted that although Xiong et al.26 does not detail its balancing process, other studies, such as that of Wang et al.,27 have adopted the SMOTE technique successfully. Specifically, Wang et al.27 they achieved an exceptional precision of 0.9905 and an AUC value of 0.846. Given the performance results of the models, this study has recognized RandomForest as one of the best classifiers, aligning with observations in.15,26 It is relevant to mention that this work has obtained an AUC value that slightly exceeds that reported by,17 but as the dataset is smaller, more tests must be performed before reaching definitive conclusions. When looking at other metrics, such as accuracy, this work has achieved comparable results to studies such as,14 which achieved an accuracy of 0.94 and an F1 Score of 0.86, and with an accuracy19 of 84.2 % and an AUC value of 0.874 for the first group (age ≤ 70) and 0.842 for the second (age ≥ 70).

Finally, while studies such as that of Iwendi et al.,14 which highlighted a correlation between gender and mortality, and that of Cui et al.,19 focused on the differences in biomarkers according to age, have provided unique perspectives, the present work has focused on validating and comparing techniques based on clinical analyzes with the aim of identifying the most appropriate approach to predict outcomes in patients with COVID-19. Future work could include the participation of these variables in the models.

4.3 Variable reduction

Feature selection has notably benefited from the use of metaheuristic algorithms, which have shown consistent improvements over traditional heuristic techniques, especially in high complexity areas such as COVID-19 severity prediction. In this work, we sought to use diverse algorithms inspired by evolution, physical mechanisms and collective intelligence, as mentioned in the work of Agrawa et al.28 It should be noted that the algorithms applied were quite simple, trying to preserve the minimum version of each one to encourage equality of conditions, so they cannot be directly comparable with those of other works that present cumulative improvements said algorithms.20,29,21,22,23 Relevant differences with said works will be discussed below.

The genetic algorithm has proven effective in several contexts, including the work of Kabir et al.,20 which introduced a redundancy reduction approach. In our study and in that of Hayet et al.,29 the genetic algorithms provided remarkable results, although the choice of appropriate objective functions could have improved the performance. Focusing on the detection of COVID-19, the study by Hayet et al.,29 highlights the importance of specific variables, such as CRP, Respiratory Rate, Oxygen Saturation, and LDH. The consistency in the selection of these variables between different investigations highlights their relevance, although differences have also been observed, such as the elimination of certain variables due to insufficient quality of the data.

Simulated Annealing, analyzed by Guha et al.,21 has proven to be useful for multidimensional problems. This work mentions a hybridization with a recently developed optimization algorithm called equilibrium optimization (Equilibrium Optimizer) which, according to said work, is capable of working during the exploitation phase to obtain better results than just simple simulated annealing, for this it is considered that simulated annealing works as a local preseeker. Additional mention is made of the work of Bandyopadhyay et al.,23 where a two-stage pipeline was proposed for the detection of COVID-19 in radiological images. They used a DenseNet-based CNN for feature extraction and combined the Harris Hawks Optimization (HHO) algorithm with Simulated Annealing (SA) and chaotic initialization for selection. The study achieved an accuracy of 98.85% and reduced the number of features by 75%.

Despite efforts to use crow search, highlighted by Chen et al.,22 in our work, this algorithm showed limitations, including the potential for overfitting. In this study, the search is improved using a hierarchical approach that reports favorable results. In all previous works, an improvement in the results can be seen by hybridizing the algorithms with pre-optimization during the search phase that allows a better exploitation phase.

5. Conclusions

In this study, a comparative analysis of different classification models was carried out to predict the severity of COVID-19 cases with biological markers. The results obtained provide relevant information for the understanding and management of the disease. Below are the main conclusions of the study:

  • Model evaluation: Five classification models (Random Forest, Support Vector Machine, Logistic Regression, Decision Tree and Neural Network) were compared using metrics such as precision, recall, F1-Score and accuracy. Most models showed similar performance across all metrics evaluated. The Random Forest model showed exceptional performance with cross-validation of 0.9727 and perfect metrics for the rest.

  • Correlation of variables: A correlation analysis was carried out between the variables and the severity of COVID-19. It was observed that the biological markers ferritin and oxygen saturation (sat. O2) showed the highest correlations (0.8277, -0.6444) in absolute value with the target variable (COVID19_Severity). These findings may be useful in understanding factors (biological markers) related to disease severity.

  • Feature Selection: Experiments were conducted to evaluate the efficiency of the model when using different sets of predictor variables (biological markers). The experiments were carried out with genetic algorithms, simulated annealing and crow search. It was found that even with only two biological markers (variables), ferritin and oxygen saturation, the model maintained a high level of accuracy (0.9636). This suggests that it is possible to reduce the number of variables without a significant loss in the predictive ability of the model. The best model (fewer variables and more precision) was found to be one that included the features ferritine, urea and Cl without losing any of the metrics.

  • Overfitting evaluation: Cross-validation was performed and the importance of features in the Random Forest model was examined. The results indicated a stable performance of the model, with a small standard deviation and a prominent importance of the variables ferritin and oxygen saturation.

6. Recommendations

  • Expansion of the dataset: It is considered that the dataset is relatively small and a larger sample would be preferred for better generalization of the model's classification patterns.

  • Validation with external datasets: Although cross-validation allows us to better perceive the generalization of the models, it would be good to be able to validate said model with public datasets. However, many of these datasets do not have the diversity of features necessary to fully use the models in this work. It should be noted that public datasets require a lot of preprocessing and exhaustive research would be necessary for such an experiment. If it is not possible to completely make these data sources compatible, the number of features to be used could be limited in order to homogenize the data.

  • Feature Selection: Although experiments were performed with fairly minimalist versions of the algorithms, it is possible to use some of the variants mentioned during the discussion. Its implementation would require a lot of experience in the subject but could suggest significant improvements in the convergence of said selections.

  • Overfitting evaluation: With a homogenized external dataset or more cases, you could have a better idea of the overfitting of the generated models.

Ethical considerations

The ethical and legal guidelines considered for sampling are described in De la Cruz-Cano et al.25

Author contributions

Eduardo de la Cruz-Cano - collect the dataset studied in this research and data curation.

Freddy de la Cruz-Ruiz - project administration, experimental study design.

Juan Pablo Olán-Ramón - software, wrote code, conducted experiments and wrote the manuscript draft.

Sarai Aguilar-Barojas - review and approved the manuscript to be submitted.

Erasmo Zamarron-Licona - review and approved the manuscript to be submitted.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 25 Jun 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Olán-Ramón JP, De la Cruz-Ruiz F, De la Cruz-Cano E et al. Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:688 (https://doi.org/10.12688/f1000research.150128.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 25 Jun 2024
Views
7
Cite
Reviewer Report 09 Oct 2024
Gustavo Sganzerla Martinez, Dalhousie University, Halifax, Nova Scotia, Canada 
Not Approved
VIEWS 7
I read with interest the work from Olan-Ramon et al. In their work, the authors compared different machine learning algorithms to predict severity in COVID-19 patients based on laboratory biomarkers.

Overall, the introduction is extensively long and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Martinez GS. Reviewer Report For: Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:688 (https://doi.org/10.5256/f1000research.164667.r322245)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
22
Cite
Reviewer Report 28 Jun 2024
Maitham Ghaly Yousif, College of Science, University of Al-Qadisiyah, Baghdad, Iraq 
Approved with Reservations
VIEWS 22
Assessment and Recommendations for the Study:
Strengths:
  1. Algorithm Selection and Evaluation: The study effectively utilised a variety of machine learning algorithms, evaluating them comprehensively using multiple accuracy metrics, which enhances the credibility of the results.
... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Yousif MG. Reviewer Report For: Identification of Biomarkers for Severity in COVID-19 Through Comparative Analysis of Five Machine Learning Algoritms [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:688 (https://doi.org/10.5256/f1000research.164667.r296235)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 25 Jun 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.