Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.178061.1

Research Article

Articles

Evaluation of Feature Attributes for the Prediction of Heart Disease Using Machine Learning

[version 1; peer review: 1 approved with reservations]

Chavan

Sneh S

Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0009-0009-0101-2939 1 Ajitha

Resources Supervision 2 Tailor

Conceptualization Formal Analysis Supervision Validation https://orcid.org/0000-0001-6096-458X a 3 Chakravarty

Shilpi

Investigation Project Administration Visualization https://orcid.org/0000-0002-7138-0181 b 4 1Assistant Professor, School of Computer Science and Applications, Chhatrapati Shahu Institute of Business Education and Research (CSIBER), Kolhapur, MH, India 2Department of Computer Applications, M.S. Ramaiah Institute of Technology, Visvesveraya Technological University, Belagavi, Karnataka, 590018, India 3Director, Chhatrapati Shahu Institute of Business Education and Research (CSIBER), Kolhapur, MH, India 4Associate Professor, Centre for Distance and Online Education, Manipal University, Jaipur, Rajasthan, India

a drrktailor@siberindia.edu.in b shilpi.chakravarty@jaipur.manipal.edu

No competing interests were disclosed.

16 5 2026

2026

743

25 3 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Every year, approximately 20.5 million people die of cardiovascular diseases (CVDs). Early detection of CVD helps people to treat it. As a result, patients can alter their daily schedules and, if required, take medications. According to World Health Organization (WHO) reports, CVD causes approximately 20.5 million people annually. By 2030, these deaths is expected to reach 24 million, accounting for 31.5% of all deaths worldwide. According to a WHO study, medication therapy and patient counseling are also necessary to lower the risk of heart attack and stroke by 2025. ^{1,
2}

Methods

For the early prediction of CVD, six machine learning methods, including the regression model, naïve Bayes, random forest, logistic regression, XGBoost, and LightGBM, were employed. Thirteen features were chosen for training. The models were trained in three ways, namely with full thirteen features, features selected by the chi-square test, and features with 0.75, 0.5 correlated values between each other. The performance metrics considered for the evaluation of the model were accuracy, F1-score, recall, and precision.

Results

Random forest provided 99% of the highest accuracy by considering all features. Feature reduction based on correlation was used for training, and accuracy was evaluated. Python scripting language was employed to implement the proposed model.

Heart disease Preprocess Feature selection Machine learning.

The author(s) declared that no grants were involved in supporting this work.

1. Introduction

Currently, CVD is of great concern in the medical field. CVD is the most common chronic and death-causing disease. Worldwide, a higher percentage of people die of CVD, as per the World Heart Report (WHO). ¹ It also states that approximately 85% of CVD-suffering patients end up heart attacks and strokes. A survey conducted by the WHO also states that approximately 20 million people die of CVD every year. This mass death holds 31% of total deaths caused globally. This number may increase to 24 million in another five years if early detection and treatment are not performed. ² An attack occurs due to a clot of blood, cholesterol, fat, and other substances deposited in the arteries of the heart. This blocks the flow of blood to certain parts of the heart and causes it to stop. The reasons for heart attacks include obesity, diabetes, sedentary lifestyle, stress, unhealthy diet practices, high blood pressure, and cholesterol. If blood goes to or inside the brain clots, stroke occurs as blood circulation stops. ³ A heart attack occurs when the heart fails to pump blood into all parts of the body. ⁴ Symptoms of CVD include shortness of breathing activity, variation in heartbeat, dizziness, sweating, nausea, discomfort in the chest area, and swelling of the feet. With an early sense of symptoms and appropriate medication, patients can come out of danger. Other causes of CVD include obesity, high BP, alcohol intake, lack of physical activity, genetic mutations, and high cholesterol. If detection occurs earlier, the patient can change their lifestyle, include more physical activities, and avoid alcohol and smoking, which can help reduce the mortality rate. ⁵

Current laboratories are equipped to diagnose heart disease using the patient’s medical history and symptoms experienced by the patient. Finally, doctors analyze the reports generated from the lab to make a final decision. A few studies say that approximately 67% of patients are predicted accurately in the presence of CVD. ⁶ For accurate detection, there is a need for an automatic system that is essential for the accurate prediction of CVD. Recent research on machine learning models helps to improve decision-making, which leads to many research opportunities in the health domain, ⁷ especially in the early detection of CVD and other chronic diseases, which avoids deaths. Machine learning is used in many applications including disease risk detection, tumor detection, and other health-related issues. It provides predictive modeling techniques to overcome current limitations. Machine learning models are used in the majority of healthcare domains owing to their predictive modeling techniques. Because of this advancement, doctors can save time by investing in reports, which can then be used to provide highly accurate medications. Because of this advancement, doctors can save time by investing in reports, which can then be used to provide highly accurate medications. Machine learning models include regression and classification phases. The classification phases of machine learning models are widely employed in the health domain. Supervised machine learning models provide greater accuracy in detecting whether a patient is healthy or unhealthy. ⁸

In 2024, ⁹ the author proposed machine learning methods to detect heart disease using the dataset presented in Kaggle. The dataset was named Heart 2020. Employed stack of machine learning models, such as Random Forest, Decision Tree, LightGBM, and Logistic Regression and LightGBM. They achieved the highest accuracy of 76.9%, and limitation of study was that exposure to various datasets was essential. Statlog from the UCI website Cleveland dataset was used to train and test the machine learning model. Achieved 88.87% of maximum accuracy for Cleveland and 88.88% for Statlog dataset. ¹² The same dataset was employed by another researcher; however, novelty exists in feature extraction and classification. Employed feature selection suggested by PCA and RFE. For classification, bagging, boosting, and ensembling were performed with an ANN. Achieved accuracy was 94.1%. ¹³ They prepared a comparison table by comparing ensemble classifiers with existing machine-learning models. The new methods employed for feature extraction are OneR, GA and Correlation. Achieved An accuracy of 67% by SVM and 8.16% by using correlation method with hybrid models. The dataset employed was from the Framingham Heart Study. ¹⁴ A few adaptive feature selection methods have gained importance in extracting features using RFE methods. The author ¹⁵ used these adaptive feature selection methods along with the RFE methods. For classification, SVM, LR, decision tree, and random forest (RF) were employed and achieved a high accuracy of 97.4% by RF. With several other datasets, many research work was carried upon and achieved high accuracy in machine learning models 96.21% ¹⁷ and 95.08%. ¹⁶ With few modifications in feature extraction by using sequential feature selection based on gradient boosting accuracy has been increased to 98.78%. ¹⁹ Another feature extraction such as Recursive Feature Elimination with cross-validation proposed by, ¹⁸ increased existing performance by 14.81%. Later, using the same two machine learning models, SVM and KNN ¹⁰ proposed a methodology for feature extraction by the year 2024. Feature attributes were selected from the chi-square statistic method, and the optimizer employed was cuckoo search optimization. The etched features are fed to the classifiers. Researchers moved to deep learning and hybrid models to predict heart disease, in this fashion in year 2024 author had employed CNN-UMAP and achieved an accuracy of 91.88%. Feature selection techniques include Relief, UMAP, and LDA. Based on the study, the gap identified is based on three aspects: first, on the dataset, to train and test the model requires patient data. If the number of patients was small, the model could be overfitted. Hence, there is a need for more patients to provide efficient accuracy. Improvement of feature extraction methods by selecting features using various methods. The Objective of this study was to generate a dataset with a higher number of patients (1300) compared to the existing 300 patients’ data and to design and develop an AI-based machine learning model to predict heart disease in patients.

The contribution of the current research work starts with the preparation of the dataset and has combined two datasets (VA Long Beach, 303 patients; cardiovascular disease, 1000 patients) to increase the number of patients. The number of patients had increased to 1303. The selection of feature attributes was based on 0.75 and 0.5 and evaluated for every feature set. Along with the correlation features selected by the chi-square test, they were trained and tested on machine learning models. For classification, six machine learning methods were employed to evaluate the datasets. To increase the accuracy of the model, other parameters such as F1 score, precision, and recall are tabulated. Tuning machine-learning models to provide good accuracy. To avoid under-and overfitting the deltastop, max_depth was tuned in the XGBoost classifiers.

Organization of paper

Section 1 describes the introduction of heart diseases and their death ratios globally. It also describes how machine-learning models can address this with minimum time. Section 2 describes the related work on detecting heart diseases using machine learning. Section 3 describes the proposed methodology for detecting machine learning methods. Finally, conclusions on the proposed work and future directions are provided.

2. Methodology

This section [ Figure 1] describes the steps involved in the detection of heart disease. The steps are as follows: 1.

Collection of the dataset

Pre-process the dataset to identify missing values and fill them.

Evaluate the feature attributes based on correlation

Train machine learning models

Evaluation of models

Figure 1. Proposed flow diagram.

The first step in every machine learning model is data collection, and we collected a dataset by combining two datasets, namely, the Cleveland, Hungary, Switzerland, and Long Beach V. datasets and the archive dataset from Kaggle. ^{21,
22} A combination of both datasets recorded 1303 patients with a total of 13 features. In the pre-processing step, missing values are handled by calculating the sum of the neighboring rows. This replacement increases the accuracy of machine learning models. Later, feature selection is performed using the correlation (0.75 and 0.5) between features and features selected by the chi-square test. The next step is to apply machine learning models to predict CVD. The model was evaluated after it was trained with all features and two sets of correlation values. The dataset was divided into training and testing groups in a 70:30 ratio. To avoid overfitting and underfitting, the models were tuned.

2.1. Details of dataset

The dataset has 13 features, including one feature as a target (which indicates healthy and unhealthy states), and it is extended to every patient. This is a balanced dataset with both categorical and numerical variables, which is ideal for developing predictive models and analyzing heart diseases. When an attribute has less than ten different classes, it is considered categorical or nominal. Some of them include sex, type of chest pain, fasting blood glucose level, ECG findings, and many more. The gender attribute is binary, with ‘1’ for male and ‘0’ for female based on the sex attribute. Chest pain type (cp) is categorized into four distinct classes: These subtypes include typical angina, atypical angina, non-anginal chest pain, and asymptomatic chest pain. Fasting blood sugar (fbs) was also binary, depicting a value of more than 120 mg/dL (1 for true, 0 for false).

Three classes were used to classify the results of the resting electrocardiogram (resting): normal, ST-T wave changes, and definite LVH. Another binary variable, exercise-performed angina (exang), assigns a value of 1 to indicate the presence of chest pain during exertion and a value of 0 to indicate its absence. There were three types of slopes for the peak exercise ST segment: downslope, flat, and upslope. While the fluoroscopy visualization of the number of major vessels (ca) ranges from 0 to 3, the thalassemia attribute (thal) categorizes the heart status as either normal, fixed defect, or reversible defect.

The numerical features return continuous data, which is valuable for fine-grained analysis, as opposed to nominal features. These variables included age, resting systolic blood pressure in millimeters of mercury (trestbps), serum cholesterol in milligrams per deciliter (chol), maximum achieved heart rate (thalach), and ST segment depression compared to rest (oldpeak). They record specific clinical parameters that are crucial for assessing the cardiovascular function.

It is also critical to recognize that the target attribute has been transformed into a binary variable from its original quantification into five classes that represent varying degrees of heart disease risk. This simplifies the problem by transforming it into a classification problem, with the goal of ascertaining whether a person has heart disease [ Table 1].

Table 1. Feature attributes and its range.

Name of features	Character of feature	Range of features
Age	Integer type	30 to 77
Sex	Categorical	Female is represented as zero. One is for Male.
Cp	Integer type: pain in chest	1 to 4 1- Typical angina 2- Atypical angina 3- Non angina pain 4- Asymptomatic in nature
Trestbps	Integer: It is the blood pressure while admitting to hospital	Ranging from 94 to 200
Chol	Integer: It is the cholesterol level	Ranges from 126 to 564
fbs	Blood sugar level before food-logical type	0 - > if value <120 mg/dl 1 - > when >120 mg/dl
Restecg	Electrocardiographic Categorical	1- Normal 2- ST T wave, which is abnormal 3- Definite left ventricular
Thalach	Integer type: Person’s maximum heartbeat	Ranges between 71 to 202
Exang	Categorical: angina	1-Yes 0-No
Oldpeak	Integer type: ST depression induced by exercise relative to rest	Ranges between 0 to 6.2
Slope	Integer type: It is a slope of peak exercise ST segment	1- Upsloping 2- flat 3- downsloping
Ca	Major vessels, which is given by coloured fluoroscopy- integer type	Ranges between zero to three
Target	Text format	Yes and No (disease present or not)

In order to increase number of patients in the dataset combining two datasets namely 1.

Cleveland, Hungary, Switzerland, and the VA Long Beach ²² – 303 patients

Cardiovascular Disease ²¹ – 1000 patients

The two datasets shared almost the same features, but different names were provided to the features. Table 2 shows naming conventions; the thallium feature is not present in 2 cardiovascular diseases, hence leaving these features and considering the rest of the features. The statistical data for each attribute, including the minimum, maximum, mean, standard deviation, 25%, 50%, and 75%, are displayed in Table 3. As a result, various machine learning models have been trained using a combined dataset to identify the classifier that is most effective in detecting CVD.

Table 2. Details of feature attributes between two datasets.

Dataset 1: Cardiovascular disease	Dataset 2: Cleveland	Details
Age	Age	Age of patients are present in both dataset
Gender	Sex	1 as male and 0 as female is encoded
Chestpain	Chest pain type	4 types of chest pain
restingBp	BP	It is resting BP (mm Hg)
Serumcholestrol	Cholesterol	Cholesterol level (mg/dL)
Fastingbloodsugar	FBS over 120	Fasting blood sugar level. 1 if >120 0 if FBS < 120
Restingrelectro	EKG results	ST wave (electrocardiogram)
Exerciseangia	Exercise angina	Presence of angina or not 1-yes 0-no
Oldpeak	ST depression	Both features gives same ST segment depression induced by exercise
Slope	Slope of ST	Peak exercise (up, flat and down slope)
Noofmajorvessels	Number of vessels fluro	Fluoroscopy type (0–4)
Not present	Thallium	Thallium scan result
Target	Heart Disease	Disease presence or not 1-Yes 0-No

Table 3. Dataset statistical information.

a) Integer type feature attributes details
Feature/measure	Age	Resting bps	Chol	oldpeak
Mean	53	132.00	217	0.94
Std	9.313	18.27	95	1.09
Min	28	0	0	−2.6
25%	47	120	197	0
50%	54	130	233	0.6
75%	60	140	271	1.6
max	77	200	603	6.2

b) Categorical type feature attributes details
Feature/measure	Sex		cp				fbs		restingecg			exang		slope
Label	0	1	1	2	3	4	0	1	0	1	2	0	1	0	1	2	3
Percentage of occurrence (%)	28	72	6	19	27.85	47.2	89.5	10.5	62	3	35	69.9	30.1	1	44	48	7
Missing values	Zero		Zero				Zero		Zero			Zero		Zero

2.2. Pre-processing of dataset

If no value is assigned to a particular variable in an observation, it is referred to as missing or incomplete. These missing values can originate from different circumstances, such as when the respondent fails to answer some questions, failure of the sensor, data loss during transfer, disruption in the network connection, or a mathematical operation, such as division by zero. In datasets [ Figure 2], missing values can be indicated by spaces, hyphens, or other marks that differentiate them from other regular values.

Figure 2. Visualization of features in dataset.

The missing values may or may not affect the statistical validity of the outcome. However, the outcomes may lack robustness or accuracy due to omitted information, even though analysis can move forward with incomplete data. Even if each individual variable has only a small percentage of missing data, the cumulative amount across the dataset may be significant, thereby influencing the analysis results. It is worthwhile to replace the observations rather than deleting them because observations with missing values can be quite informative. The strategies include: •

It preferable to use the mean value of the variable of interest to avoid bias.

•

The Cleveland heart dataset contains missing values for the nominal attributes thal (thalassemia) and ca (the number of major vessels). These were handled as follows.

•

This attribute had four items with missing values that were replaced by the most frequent value, which was zero and occurred 176 times in 299 records.

2.3. Classification by machine learning models

Earlier studies used multiple supervised machine learning models on a single dataset to predict CVD. Six machine learning models are discussed in this section to predict patients who may be at risk of CVD.

Regression models estimate the target variables on a continuum rather than grouping the outcomes into categories. Linear regression is the basic form of regression analysis, which assumes linearity and attempts to minimize the error between the actual and predicted values. Other types include polynomial and ridge regression, which are for more complex relationships, or when regularization is required. ^{22,
23}

Naïve Bayes is a probabilistic classifier that relies on the assumption of feature independence and is derived from the base Bayes formula. This algorithm is computationally efficient and is widely applied to text categorization, spam detection, and sentiment analysis. However, it is efficient and can work well in various real-life situations, particularly when dealing with big data. Based on the input features, it assigns an instance to the class with the highest posterior probability. ²⁴

Random Forest is another type of meta-cascade that is composed of many decision trees, where the bagging technique is adopted for increased precision and stability. A subset of the data is used to train each tree, and the final decision (class for classification or mean for regression) is made by considering the results of all the trees. This is less prone to overfitting than individual decision trees, and is flexible for tasks involving structured data. ²⁵

Logistic regression is a classification algorithm that predicts probabilities by applying a sigmoid function, which is appropriate when the output is binary. Although closely related to linear regression, it transforms the results into the range of [0, 1]. It has been applied in disease prediction, email classification, and customer churn analysis because of its simplicity and interpretability. ²⁷

The gradient-boosting framework XGBoost is popular and often wins machine-learning competitions because it is built for high performance. It iteratively constructs decision trees, gradually corrects the mistakes made in prior steps, and employs shrinkage to curb overlearning. It performs well when dealing with missing values and large structures. ¹⁴ Modifications were performed to avoid under-and overfitting. •

The maximum depth was set to five, and the tree did not grow beyond level 5. If a deeper model is overfitted, it is restricted to five. ¹⁴

•

The delta step is set to 0.1: when there is an imbalanced dataset, the model learning rate will be very low. During these stages, the small-delta step model adjusts the weights accordingly and reduces overfitting. ²⁵

•

Gamma is set to 0.6; this term is called regularization, which controls the complexity of the model when new trees are added. 0.6 gives good results during new tree additions. ¹⁴

These modifications in the model help to fight during imbalance dataset and avoids overfitting.

LightGBM is a gradient boosting framework that supports multiclass classification and regression problems and is based on a speed-optimized decision tree. It contains a histogram-based algorithm to build a decision tree for the learning process and manages the memory well. Its approach, such as leaf-wise tree growth, makes it faster than XGBoost for a variety of tasks including recommendations, clicks, and ranking. ²⁶

2.4. Features evaluation

To select important features for improving classification that are correlation-based (with two 0.75 and 0.5), chi-square feature selection is employed in the current research. Correlation describes the similarity between two components. This method determines which feature is highly efficient in predicting targets individually. The highly correlated features [ Table 4] between each other, have considered two thresholds to collect correlated values are 70% and 50% of target values. These features have low redundancy and are highly relevant to the target classes. ²⁸

Table 4. Highly correlated features to target values.

Correlation value	Feature name	Feature number
75%	Chest pain type	3
	Exercise_angin	9
	St_slope	11
50%	Sex	2
	Chest pain type	3
	Fbs	6
	Exercise_sngin	9
	St_slope	11

Another feature selection method called chi-square attribute evaluation is the arrangement of filters in the order of the computed chi-square statics ²⁹ of 10 features. It computes the score of all features with the target values and selects the top ten features. The attributes selected from the chi-square [ Table 5] by eliminating the last two features and training with chi-square-suggested features using machine learning algorithms.

Table 5. Selected features from chi-square.

Feature number	Feature name	Rank of test
1	Age	134.75
3	Chest pain type	56.027
4	Resting bp	53.44
5	Cholesterol	913.06
6	Fasting blood sugar	43.41
8	Max heart rate	1656.44
9	Exercise angin	111.004
10	Old peak	524.05
11	St slope	51.71
12	num of vessels fluro	32.40

3. Results

This section provides an overview of the results obtained by using the proposed model. The implementation starts with a collection of datasets. We combined the two datasets ^{21,
22} to form 1303 patients. The details of the dataset consist of 13 features, including a target feature for healthy and unhealthy individuals, and are ideal for developing predictive models and analyzing heart disease. It includes categorical and numerical variables, such as sex, chest pain type, fasting blood glucose level, ECG findings, exercise-performed angina, and the slope of the peak exercise ST segment. Numerical features, such as age, resting systolic blood pressure, serum cholesterol, maximum achieved heart rate, and ST segment depression compared to rest, provide continuous data for fine-grained analysis.

The target attribute, initially quantified into five classes, was converted into a binary variable, making it easier to determine whether a person had heart disease. Considering these features, models are trained and evaluated, and this is performed by selecting three different sets of features from 13 features. First, all models are trained using the full 13 features; second, all models are trained using the features selected by the chi-square test. Finally, the models are trained using the features selected by the correlation. Had chosen the correlation 0.75 and 0.50 each other because 0.75 indicates a strong relationship between the features. On the other hand, 0.50, was moderately related to each other. This provides the highest and moderate features that help achieve higher accuracy.

3.1. Dataset information

The dataset employed for all cases is initially discussed, and the results for different cases are provided in the following subsections.

The instance of the employed dataset is obtained by combining the two datasets [ Figure 3], namely Cleveland, Hungary, Switzerland, and the VA Long Beach ²³ with 303 patient data and Cardiovascular Disease ²¹ with 1000 patient data.

Figure 3. Instance of employed dataset with visualization of counts.

The dataset information [ Figure 4], which includes the total number of entries, number of features recorded per patient, data type, and count of NaN values (black, NA data, or missing values).

Figure 4. Information of dataset.

The distribution of healthy and unhealthy patients [ Figure 5] indicates that the dataset is imbalanced in nature. To address this, a few modifications are made to the XGboost algorithm, as stated in Section 3.3. To visualize missing values, a bar chart is plotted using msno, which provides data points for every column if missing values are present in any column gaps present in the bar charts. There is no discontinuity in the graph [ Figure 6], which clearly indicates that there are no missing values in any column, and the 1303 values indicate that all columns have 1303 entries in the dataset. The heat map [ Figure 7] for the dataset is computed, which gives Pearson correlation coefficients and denotes −1,+1, and 0 + 1, indicating that the value is highly correlated, −1 is a negative correlation, and 0 means no correlation between the target and features. The data [ Figure 8] for training and testing values. If 50% is shown, then the dataset is balanced.

Figure 5. Count of healthy and unhealthy CVD patients in Dataset. Figure 6. Missing values are demonstrated. Figure 7. Heat map of dataset. Figure 8. True and false cases allotted for training and testing in dataset. 3.2. Calculation of Performance metrics for Full Feature Attributes

The performance metrics are listed in Table 6, which includes the confusion matrix, accuracy, precision, recall, and F1 score for the different machine learning models employed. For the analysis in this section, we consider a full set of features. A full set of features means that all 13 features are considered for predicting the health of a patient. Eight machine-learning models have been considered for prediction. To determine the model’s performance, we considered metrics such as the accuracy, precision, recall, F1 score, and confusion matrix. Of these machine learning models, random forests outperformed in the detection of diseases. It achieved a high accuracy of 97.77 and rest metrics such as precision, recall, F1 score of 98.

Table 6. Performance matrix for training data for different machine learning models.

Name of the model	Accuracy	Precision	recall	F1 score	Confusion matrix
Linear regression	82	81	81	81	[[207 33] [36 115]]
Lasso regression	79	78	77	77	[[201 39] [44 107]]
Ridge regression	82	81	81	81	[[206 34] [36 115]]
Naïve bayes	80.15	80	80	80	[[417 91] [90 314]]
Random forest	97.77	98	98	98	[[496 12] [9 395]]
Logistic Regression	79.61	79	79	79	[[419 89] [97 307]]
XGboost	87.6	87	88	87	[[440 68] [46 358]]
LightGBM	88	88	89	88	[[443 65] [40 364]]

In addition to the accuracy, the model achieved good values in other metrics, which shows that the model is robust to real-world data. In the confusion matrix, the predictions of True Positive (TP) and true negative (TN) were quite good; there were fewer than 12 and 9 patients that were misclassified as false positive and false negative, respectively. This indicates that the misclassification from the model was comparatively less. A graphical representation of the performance metrics considering the full feature set. Figure 9 shows the performance metrics in a visual chart format. This provides a clear view of which model provides greater accuracy in detecting the heart condition of a patient.

Figure 9. Graphical representation of performance metrics for full feature sets. 3.3. Calculation of Performance metrics for correlation 0.75

Performance metrics [ Table 7] by considering correlation of 0.75 and 0.50 and feature names are chest pain type, exercise angin and st_slope. It also includes the confusion matrix, accuracy, precision, recall, and F1 score for the different machine-learning models employed. The analysis in this section did not consider the full set of features. Instead of all features, 0.75 correlation features are selected. This means that out of 13 features, three correlated features were selected to predict the health of the heart. We considered eight machine learning models for prediction. To determine the model performance, we considered metrics such as the accuracy, precision, recall, F1 score, and confusion matrix. Of these machine learning models, LightGBM outperformed the others in the detection of diseases. It achieved a high accuracy of 77 and metrics, such as precision, recall, and F1 scores of 76, 77, and 76, respectively. As the number of features decreases, it becomes difficult for the model to learn. The accuracy of the model is lower because three features are considered to train the model. Similarly, in accordance with the accuracy, other metric values are also lower. The confusion matrix shows TP and TN values of 186 and 116, respectively, which are good, but not excellent, to adopt models for real-world patient analysis. False positives and negatives are 54 and 35, respectively, which give incorrect predictions. Due to the wrong prediction of healthy as unhealthy and unhealthy patients as healthy lands, the entire family is at risk. Unhealthy patients are left untreated, and healthy patients undergo unnecessary treatment and financial overhead. A graphical representation of the performance metrics considering the full feature set. The performance metrics [ Figure 10], making them easier to interpret than tables. This helps to highlight which model achieves higher accuracy in identifying heart conditions in patients.

Table 7. Performance metrics by considering features based on correlation 0.75.

Name of the model	Accuracy	Precision	recall	F1 score	Confusion matrix
Linear regression	77	77	78	77	[[185 55] [33 118]]
Lasso regression	76	75	75	75	[[193 47] [46 105]]
Ridge regression	77	77	78	77	[[185 55] [33 118]]
Naïve bayes	77	77	78	77	[[379 129] [78 326]]
Random forest	77	76	77	76	[[186 54] [35 116]]
Logistic Regression	77	76	76	76	[[402 106] [108 296]]
XGboost	78	78	78	78	[[406 102] [100 304]]
LightGBM	78	78	78	78	[[384 124] [76 328]]

Figure 10. Performance metrics by considering feature set of 0.75 correlated values. 3.4. Calculation of Performance metrics for correlation 0.5

Performance metrics [ Table 8] by considering correlation of 0.75 and 0.50 and feature names are sex, chest pain, fasting blood sugar, exercise_angin and st_slope. The analysis in this section did not consider the full set of features. Instead of all the features, with 0. Five correlation features are selected. This means that out of 13 features, three correlated features were selected to predict the health of the heart. Eight machine-learning models have been considered for prediction. To determine the model’s performance, we considered metrics such as the accuracy, precision, recall, F1 score, and confusion matrix. Of these machine learning models, random forest outperformed in the detection of diseases. It achieved a high accuracy of 80.37, and other metrics, such as precision, recall, and F1 score, were 80. As the number of features decreases, it becomes difficult for the model to learn. The accuracy of the model is lower because three features are considered to train the model. In a similar fashion, in accordance with the accuracy, other metric values are also less. The confusion matrix shows TP and TN values of 103 and 76, respectively, which are good but not excellent for the adoption of models for real-world patient analysis. False positives and negatives are 103 and 76, respectively, which give incorrect predictions. Owing to incorrect predictions, there will be unnecessary overhead for the respective families. A graphical representation of the performance metrics considering the full feature set [ Figure 11] provides a good overview of an eagle’s eye view by observing line graphs.

Table 8. Performance metrics by considering features based on correlation 0.50.

Name of the model	Accuracy	Precision	recall	F1 score	Confusion matrix
Linear regression	80	79	80	79	[[189 51] [28 123]]
Lasso regression	79	77	77	77	[[197 43] [41 110]]
Ridge regression	80	79	80	79	[[190 50] [29 122]]
Naïve bayes	77.3	77	77	77	[[391 117] [90 314]]
Random forest	80.37	80	80	80	[[405 103] [76 328]]
Logistic Regression	77.49	76	76	76	[[200 40] [48 103]]
XGboost	80	80	80	80	[[408 100] [84 320]]
LightGBM	80	79	79	79	[[409 99] [87 317]]

Figure 11. Performance metrics for 0.50 correlated values. 3.5. Calculation of Performance metrics for chi-square

The performance metrics [ Table 9] using the chi-square feature matrix, and the feature names are age, chest pain, resting_bp, fasting blood sugar, max_heart_beat, exercise_angin, oldpeak, st_slope, and num_vessls_fluro. The analysis in this section did not consider the full set of features. Instead of all feature sets, we selected features using chi-square. This means that out of 13 features, nine features were selected to predict the health of the heart. Eight machine-learning models have been considered for prediction. To determine the model performance, metrics such as accuracy, precision, recall, F1 score, and confusion matrix were used. Of these machine learning models, random forest outperformed in the detection of diseases. It achieved a high accuracy of 98 and other metrics, such as precision, recall, and F1 score of 98. The selected features are robust and efficient, facilitating efficient detection. In addition to accuracy, precision, recall, f1 score and confusion matrix were used to measure the efficiency of the model. The confusion matrix shows TP and TN values of 495 and 396, respectively, which are good, but not excellent, to adopt models for real-world patient analysis. False positives and negatives are 13 and 8, respectively, which give incorrect predictions. There are a few numbers of wrong predictions, which help in providing excellent results in detecting the health of a patient’s heart. A graphical representation [ Figure 12] of the performance metrics considering the full feature set. A clearer view can be seen in the chart compared to the table, which makes it easier to understand which model’s performance is better than the other seven machine-learning models.

Table 9. Performance metrics by considering features based on chi-square.

Name of the model	Accuracy	Precision	recall	F1 score	Confusion matrix
Linear regression	82	81	81	81	[[206 34] [36 115]]
Lasso regression	78	77	77	77	[[200 40] [45 106]]
Ridge regression	82	81	81	81	[[206 34] [37 114]]
Naïve bayes	80	80	80	80	[[404 104] [77 327]]
Random forest	98	98	98	98	[[495 13] [8 396]]
Logistic Regression	80	80	80	80	[[415 93] [91 313]]
Xgboost	87	87	87	87	[[440 68] [52 352]]
LightGBM	88	88	88	88	[[442 66] [44 360]]

Figure 12. Performance analysis for chi-square selected feature set.

In printed volumes, illustrations are generally black and white (halftones), and only in exceptional cases, and if the author is prepared to cover the extra cost for color reproduction, are color pictures accepted. If color illustrations are necessary, please send color-separated files if possible. Color pictures are welcomed in the electronic version at no additional cost. The current study utilized two datasets, which were combined by leaving the data from the Cleveland dataset and combined with the CVD dataset to make 1303 patients to predict CVD risk. With all feature sets, the maximum accuracy achieved was 97.77% the random forest algorithm. Not only is accuracy considered, other performance metrics such as f1 score (98%), recall (98%), precision (98%) models are considered as models that provide robustness in detection. The feature selection processes considered were correlation-based and Chi-square-based feature selection. Results analysis showed that the chi-square selected features list achieved a high accuracy, precision, f1 score and recall. Highest accuracy achieved compared to full feature set the chi-square feature selection method was 98%.

4. Conclusion

The current study focuses on feature selection and evaluation of machine learning models to find robust models for predicting CVD risk earlier. The employed datasets were Cleveland, Hungary, Switzerland, and the VA Long Beach ²¹ with 303 patient data and cardiovascular disease ²⁰ with 1000 patient data. These two datasets were combined to form data from 1303 patients by ignoring that feature from Cleveland, as it is not present in the cardiovascular disease dataset. Three types of feature selection mechanisms were adopted: correlation-based, with correlated values of 0.75, 0.5, and chi-square. Of these feature selection methods, the chi-square-based feature list achieved good accuracy. Excellent performance is achieved by a random forest algorithm in predicting CVD using features selected by chi-square. The highest accuracy of 98% was achieved performed by the random forest classifier. The current research covers six machine learning models and two regression models, and eight machine learning models were trained and evaluated on features selected by three feature selection methods. In the future, one can apply deep learning models can be applied to increase the accuracy of the model and to adopt more feature selection methods and models to predict CVD at earlier stages.

Ethical Approval

Not Applicable.

Consent to participate

Not Applicable.

Consent to Publish

Not Applicable.

Data availability

The datasets generated and/or analyzed during the current study are available in the Cardiovascular Disease Dataset, Mendeley Data, and the UCI Machine Learning Repository.

Repository Name: Combined Heart Patient Data, Mendeley Data, V1,

doi: 10.17632/v54h5d5pvt.1; Reserved DOI: 10.17632/v54h5d5pvt.1

The project contains the following underlying data: combined_heart_patient_data.csv. ³⁰

Data are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).

References 1

Reference Source

Healthline: (accessed on 20 February 2021). Reference Source

Chicco

Jurman

: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med. Inform. Decis. Mak. 2020;20:1–16. 10.1186/s12911-020-1023-5

Obasi

Shafiq

: Towards comparing and using Machine Learning techniques for detecting and predicting Heart Attack and Diseases. Proceedings of the 2019 IEEE International Conference on Big Data, Big Data 2019, Los Angeles, CA, USA, 9–12 December 2019. pp.2393–2402.

Sharma

Rizvi

: Prediction of Heart Disease using Machine Learning Algorithms: A Survey. Int J Recent Innov Trends Comput Commun. 2017;5:99–104.

Amogha

Deshpande

: A Review on Behavioural Biometric GAIT Recognition. Gunjan

Zurada

, editors. Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Lecture Notes in Networks and Systems. Singapore: Springer;2023; vol540. 10.1007/978-981-19-6088-8_9

Goel

Deep

Srivastava

: Comparative Analysis of various Techniques for Heart Disease Prediction. Proceedings of the 2019 4th International Conference on Information Systems and Computer Networks, ISCON 2019, Mathura, India, 21–22 November 2019. pp.88–94.

Chen

: Heart disease prediction utilizing machine learning techniques. Transactions on Materials, Biotechnology and Life Sciences. 2024;3:35–50. 10.62051/e054hq43

Bolanle

Elizabeth Ali

: Chi-Square and Cuckoo Search Based Feature Selection for Heart Disease Prediction. 2024. undefined. 10.1109/seb4sdg60871.2024.10630086

Sowmiya

: Classification of Cardiovascular Diseases from Magnetic Resonance Imaging using Classifiers. 2024 International Conference on Smart Systems for Electrical, Electronics, Communication and Computer Engineering (ICSSEECC). 2024;528–533.

Chen

Guestrin

: Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016.

Islam

: Data-Driven Heart Disease Prediction by Ensemble Feature Selection and Machine Learning Techniques. 2022 25th International Conference on Computer and Information Technology (ICCIT). 2022;575–580.

Patro

Padhy

Sah

: A Road Map for Classification of Heart Disease Using Machine Learning Classifier. Next Generation of Internet of Things: Proceedings of ICNGIoT 2022. Singapore: Springer Nature Singapore;2022;687–702.

Oleiwi

AlShemmary

Al-augby

: Adaptive features selection technique for efficient heart disease prediction. J Al-Qadisiyah Comput Sci Math. 2023;15(1):1. 10.29304/jqcm.2023.15.1.1137

Murugesan

Kavitha

Jabakumar

: Prediction of Heart Disease using Machine Learning Algorithms with Feature Selection Techniques. Cardiometry. 2023;26:778–786. 10.18137/cardiometry.2023.26.778786

Dissanayake

Johar

MGM

: Heart Disease Diagnostics Using Meta-Learning-Based Hybrid Feature Selection. Appl Comput Intell Soft Comput. 2024;2024(1):8800497. 10.1155/2024/8800497

Akyol

Atila

: A study on performance improvement of heart disease prediction by attribute selection methods. Acad Platform-J Eng Sci. 2019;7(2):174–179.

Chaurasia

: Novel method of characterization of heart disease prediction using sequential feature selection-based ensemble technique. Biomed. Mater. Devices. 2023;1(2):932–941. 10.1007/s44174-022-00060-x

Doppala

Bhattacharyya

: Cardiovascular_Disease_Dataset. Mendeley Data. 2021;V1. 10.17632/dzz48mvjht.1

Janosi

; Heart Disease. UCI Machine Learning Repository. 1989. 10.24432/C52P4X

Gelman

Carlin

Stern

: Bayesian Data Analysis. 3rd ed Chapman and Hall/CRC;2013. 10.1201/b16018

Hastie

: The elements of statistical learning: data mining, inference, and prediction. 2009.

Breiman

: Random forests. Mach. Learn. 2001;45:5–32. 10.1023/A:1010933404324

: Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Proces. Syst. 2017;30.

Wang

Zheng

: Integrative modeling of heterogeneous soil salinity using sparse ground samples and remote sensing images. Geoderma. 2023;430:116321. 10.1016/j.geoderma.2022.116321

Cox

: The regression analysis of binary sequences. Journal of the Royal Statistical Society Series B: Statistical Methodology. 1958;20(2):215–232. 10.1111/j.2517-6161.1958.tb00292.x

Guyon

Elisseeff

: An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182.

Pedregosa

: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res.

Chavan

: Combined heart patient data. Mendeley Data. 2026;V1. 10.17632/v54h5d5pvt.1

10.5256/f1000research.196399.r490357

Reviewer response for version 1

Chauhan

Alok Singh

1 Referee 1Galgotias University, Greater Noida, Uttar Pradesh, India

Competing interests: No competing interests were disclosed.

15 6 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

The article presents a machine-learning-based approach for early cardiovascular disease prediction using a combined dataset of 1303 patient records. The study compares several models, including regression-based models, Naïve Bayes, Random Forest, Logistic Regression, XGBoost, and LightGBM, with feature selection based on correlation thresholds and chi-square testing. The reported best performance is achieved by Random Forest using chi-square-selected features, with approximately 98% accuracy.

The work is generally understandable and addresses an important healthcare problem. However, the presentation requires improvement. There are repeated statements, grammatical issues, and some inconsistent reporting of the number of models used. The literature review includes some recent studies, but the references are not consistently formatted, and some citations appear incomplete or unclear. The authors should revise the manuscript for language quality, remove repetition, and ensure that all references are accurate, current, and properly cited.

The study design is partly appropriate, but several technical concerns must be addressed. The authors combined two datasets with different origins, but the harmonization process is not explained in sufficient detail. It is unclear how differences in feature definitions, missing variables, and target-label mapping were handled. The authors also report very high accuracy, which may indicate possible overfitting or data leakage. The paper should clearly describe the train-test split, random seed, cross-validation strategy, preprocessing pipeline, and whether feature selection was performed only on the training set.

The methods and analysis are not yet sufficiently detailed for full replication. Important details such as hyperparameter settings for all models, software versions, class distribution, preprocessing steps, encoding of categorical variables, imputation strategy, and validation procedure should be added. The authors should also provide the exact dataset preparation workflow and clarify how missing values were handled, since the current description is not sufficiently rigorous.

The statistical analysis and interpretation are partly appropriate. Accuracy, precision, recall, and F1-score are useful metrics, but for a medical prediction problem the authors should also report ROC-AUC, sensitivity, specificity, confidence intervals, and preferably cross-validation results. Since the dataset is described as imbalanced, relying mainly on accuracy may be misleading. Statistical comparison between models would strengthen the results.

The source data appear to be available through Mendeley Data and public repositories, which supports reproducibility. However, the authors should ensure that the exact combined dataset, preprocessing code, model-training scripts, and feature-selection code are openly available.

The conclusions are partly supported by the reported results, but they should be stated more cautiously. The authors should avoid implying direct clinical applicability unless external validation is performed. To make the article scientifically sound, the authors must address the dataset-merging procedure, prevent possible data leakage, add stronger validation such as k-fold cross-validation or external validation, provide complete methodological details, and improve reporting quality. After these revisions, the study may provide a useful comparative analysis of machine-learning models for cardiovascular disease prediction.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Artificial IntelligenceMachine LearningData MiningMedical Informatics / Healthcare AnalyticsPredictive ModelingComputational Intelligence

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.