<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.178061.1</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Research Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Evaluation of Feature Attributes for the Prediction of Heart Disease Using Machine Learning</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 1; peer review: 1 approved with reservations]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Chavan</surname>
                        <given-names>Sneh S</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Review &amp; Editing</role>
                    <uri content-type="orcid">https://orcid.org/0009-0009-0101-2939</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Ajitha</surname>
                        <given-names>S.</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Resources</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <xref ref-type="aff" rid="a2">2</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Tailor</surname>
                        <given-names>RK</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Formal Analysis</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Validation</role>
                    <uri content-type="orcid">https://orcid.org/0000-0001-6096-458X</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a3">3</xref>
                </contrib>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Chakravarty</surname>
                        <given-names>Shilpi</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Investigation</role>
                    <role content-type="http://credit.niso.org/">Project Administration</role>
                    <role content-type="http://credit.niso.org/">Visualization</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-7138-0181</uri>
                    <xref ref-type="corresp" rid="c2">b</xref>
                    <xref ref-type="aff" rid="a4">4</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Assistant Professor, School of Computer Science and Applications, Chhatrapati Shahu Institute of Business Education and Research (CSIBER), Kolhapur, MH, India</aff>
                <aff id="a2">
                    <label>2</label>Department of Computer Applications, M.S. Ramaiah Institute of Technology, Visvesveraya Technological University, Belagavi, Karnataka, 590018, India</aff>
                <aff id="a3">
                    <label>3</label>Director, Chhatrapati Shahu Institute of Business Education and Research (CSIBER), Kolhapur, MH, India</aff>
                <aff id="a4">
                    <label>4</label>Associate Professor, Centre for Distance and Online Education, Manipal University, Jaipur, Rajasthan, India</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:drrktailor@siberindia.edu.in">drrktailor@siberindia.edu.in</email>
                </corresp>
                <corresp id="c2">
                    <label>b</label>
                    <email xlink:href="mailto:shilpi.chakravarty@jaipur.manipal.edu">shilpi.chakravarty@jaipur.manipal.edu</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>16</day>
                <month>5</month>
                <year>2026</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2026</year>
            </pub-date>
            <volume>15</volume>
            <elocation-id>743</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>25</day>
                    <month>3</month>
                    <year>2026</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Chavan SS et al.</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/15-743/pdf"/>
            <abstract>
                <sec>
                    <title>Background</title>
                    <p>Every year, approximately 20.5 million people die of cardiovascular diseases (CVDs). Early detection of CVD helps people to treat it. As a result, patients can alter their daily schedules and, if required, take medications. According to World Health Organization (WHO) reports, CVD causes approximately 20.5 million people annually. By 2030, these deaths is expected to reach 24 million, accounting for 31.5% of all deaths worldwide. According to a WHO study, medication therapy and patient counseling are also necessary to lower the risk of heart attack and stroke by 2025.
                        <sup>
                            <xref ref-type="bibr" rid="ref1">1</xref>,
                            <xref ref-type="bibr" rid="ref2">2</xref>
                        </sup>
                    </p>
                </sec>
                <sec>
                    <title>Methods</title>
                    <p>For the early prediction of CVD, six machine learning methods, including the regression model, na&#x00ef;ve Bayes, random forest, logistic regression, XGBoost, and LightGBM, were employed. Thirteen features were chosen for training. The models were trained in three ways, namely with full thirteen features, features selected by the chi-square test, and features with 0.75, 0.5 correlated values between each other. The performance metrics considered for the evaluation of the model were accuracy, F1-score, recall, and precision.</p>
                </sec>
                <sec>
                    <title>Results</title>
                    <p>Random forest provided 99% of the highest accuracy by considering all features. Feature reduction based on correlation was used for training, and accuracy was evaluated. Python scripting language was employed to implement the proposed model.</p>
                </sec>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Heart disease</kwd>
                <kwd>Preprocess</kwd>
                <kwd>Feature selection</kwd>
                <kwd>Machine learning.</kwd>
            </kwd-group>
            <funding-group>
                <funding-statement>The author(s) declared that no grants were involved in supporting this work.</funding-statement>
            </funding-group>
        </article-meta>
    </front>
    <body>
        <sec id="sec4" sec-type="intro">
            <title>1. Introduction</title>
            <p>Currently, CVD is of great concern in the medical field. CVD is the most common chronic and death-causing disease. Worldwide, a higher percentage of people die of CVD, as per the World Heart Report (WHO).
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>
                </sup> It also states that approximately 85% of CVD-suffering patients end up heart attacks and strokes. A survey conducted by the WHO also states that approximately 20 million people die of CVD every year. This mass death holds 31% of total deaths caused globally. This number may increase to 24 million in another five years if early detection and treatment are not performed.
                <sup>
                    <xref ref-type="bibr" rid="ref2">2</xref>
                </sup> An attack occurs due to a clot of blood, cholesterol, fat, and other substances deposited in the arteries of the heart. This blocks the flow of blood to certain parts of the heart and causes it to stop. The reasons for heart attacks include obesity, diabetes, sedentary lifestyle, stress, unhealthy diet practices, high blood pressure, and cholesterol. If blood goes to or inside the brain clots, stroke occurs as blood circulation stops.
                <sup>
                    <xref ref-type="bibr" rid="ref3">3</xref>
                </sup> A heart attack occurs when the heart fails to pump blood into all parts of the body.
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> Symptoms of CVD include shortness of breathing activity, variation in heartbeat, dizziness, sweating, nausea, discomfort in the chest area, and swelling of the feet. With an early sense of symptoms and appropriate medication, patients can come out of danger. Other causes of CVD include obesity, high BP, alcohol intake, lack of physical activity, genetic mutations, and high cholesterol. If detection occurs earlier, the patient can change their lifestyle, include more physical activities, and avoid alcohol and smoking, which can help reduce the mortality rate.
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup>
            </p>
            <p>Current laboratories are equipped to diagnose heart disease using the patient&#x2019;s medical history and symptoms experienced by the patient. Finally, doctors analyze the reports generated from the lab to make a final decision. A few studies say that approximately 67% of patients are predicted accurately in the presence of CVD.
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>
                </sup> For accurate detection, there is a need for an automatic system that is essential for the accurate prediction of CVD. Recent research on machine learning models helps to improve decision-making, which leads to many research opportunities in the health domain,
                <sup>
                    <xref ref-type="bibr" rid="ref7">7</xref>
                </sup> especially in the early detection of CVD and other chronic diseases, which avoids deaths. Machine learning is used in many applications including disease risk detection, tumor detection, and other health-related issues. It provides predictive modeling techniques to overcome current limitations. Machine learning models are used in the majority of healthcare domains owing to their predictive modeling techniques. Because of this advancement, doctors can save time by investing in reports, which can then be used to provide highly accurate medications. Because of this advancement, doctors can save time by investing in reports, which can then be used to provide highly accurate medications. Machine learning models include regression and classification phases. The classification phases of machine learning models are widely employed in the health domain. Supervised machine learning models provide greater accuracy in detecting whether a patient is healthy or unhealthy.
                <sup>
                    <xref ref-type="bibr" rid="ref8">8</xref>
                </sup>
            </p>
            <p>In 2024,
                <sup>
                    <xref ref-type="bibr" rid="ref9">9</xref>
                </sup> the author proposed machine learning methods to detect heart disease using the dataset presented in Kaggle. The dataset was named Heart 2020. Employed stack of machine learning models, such as Random Forest, Decision Tree, LightGBM, and Logistic Regression and LightGBM. They achieved the highest accuracy of 76.9%, and limitation of study was that exposure to various datasets was essential. Statlog from the UCI website Cleveland dataset was used to train and test the machine learning model. Achieved 88.87% of maximum accuracy for Cleveland and 88.88% for Statlog dataset.
                <sup>
                    <xref ref-type="bibr" rid="ref12">12</xref>
                </sup> The same dataset was employed by another researcher; however, novelty exists in feature extraction and classification. Employed feature selection suggested by PCA and RFE. For classification, bagging, boosting, and ensembling were performed with an ANN. Achieved accuracy was 94.1%.
                <sup>
                    <xref ref-type="bibr" rid="ref13">13</xref>
                </sup> They prepared a comparison table by comparing ensemble classifiers with existing machine-learning models. The new methods employed for feature extraction are OneR, GA and Correlation. Achieved An accuracy of 67% by SVM and 8.16% by using correlation method with hybrid models. The dataset employed was from the Framingham Heart Study.
                <sup>
                    <xref ref-type="bibr" rid="ref14">14</xref>
                </sup> A few adaptive feature selection methods have gained importance in extracting features using RFE methods. The author
                <sup>
                    <xref ref-type="bibr" rid="ref15">15</xref>
                </sup> used these adaptive feature selection methods along with the RFE methods. For classification, SVM, LR, decision tree, and random forest (RF) were employed and achieved a high accuracy of 97.4% by RF. With several other datasets, many research work was carried upon and achieved high accuracy in machine learning models 96.21%
                <sup>
                    <xref ref-type="bibr" rid="ref17">17</xref>
                </sup> and 95.08%.
                <sup>
                    <xref ref-type="bibr" rid="ref16">16</xref>
                </sup> With few modifications in feature extraction by using sequential feature selection based on gradient boosting accuracy has been increased to 98.78%.
                <sup>
                    <xref ref-type="bibr" rid="ref19">19</xref>
                </sup> Another feature extraction such as Recursive Feature Elimination with cross-validation proposed by,
                <sup>
                    <xref ref-type="bibr" rid="ref18">18</xref>
                </sup> increased existing performance by 14.81%. Later, using the same two machine learning models, SVM and KNN
                <sup>
                    <xref ref-type="bibr" rid="ref10">10</xref>
                </sup> proposed a methodology for feature extraction by the year 2024. Feature attributes were selected from the chi-square statistic method, and the optimizer employed was cuckoo search optimization. The etched features are fed to the classifiers. Researchers moved to deep learning and hybrid models to predict heart disease, in this fashion in year 2024 author had employed CNN-UMAP and achieved an accuracy of 91.88%. Feature selection techniques include Relief, UMAP, and LDA. Based on the study, the gap identified is based on three aspects: first, on the dataset, to train and test the model requires patient data. If the number of patients was small, the model could be overfitted. Hence, there is a need for more patients to provide efficient accuracy. Improvement of feature extraction methods by selecting features using various methods. The Objective of this study was to generate a dataset with a higher number of patients (1300) compared to the existing 300 patients&#x2019; data and to design and develop an AI-based machine learning model to predict heart disease in patients.</p>
            <p>
The contribution of the current research work starts with the preparation of the dataset and has combined two datasets (VA Long Beach, 303 patients; cardiovascular disease, 1000 patients) to increase the number of patients. The number of patients had increased to 1303. The selection of feature attributes was based on 0.75 and 0.5 and evaluated for every feature set. Along with the correlation features selected by the chi-square test, they were trained and tested on machine learning models. For classification, six machine learning methods were employed to evaluate the datasets. To increase the accuracy of the model, other parameters such as F1 score, precision, and recall are tabulated. Tuning machine-learning models to provide good accuracy. To avoid under-and overfitting the deltastop, max_depth was tuned in the XGBoost classifiers.</p>
            <sec id="sec5">
                <title>Organization of paper</title>
                <p>
Section 1 describes the introduction of heart diseases and their death ratios globally. It also describes how machine-learning models can address this with minimum time. Section 2 describes the related work on detecting heart diseases using machine learning. Section 3 describes the proposed methodology for detecting machine learning methods. Finally, conclusions on the proposed work and future directions are provided.</p>
            </sec>
        </sec>
        <sec id="sec6">
            <title>2. Methodology</title>
            <p>This section [
                <xref ref-type="fig" rid="f1">
Figure 1</xref>] describes the steps involved in the detection of heart disease. The steps are as follows:
                <list list-type="order">
                    <list-item>
                        <label>1.</label>
                        <p>Collection of the dataset</p>
                    </list-item>
                    <list-item>
                        <label>2.</label>
                        <p>Pre-process the dataset to identify missing values and fill them.</p>
                    </list-item>
                    <list-item>
                        <label>3.</label>
                        <p>Evaluate the feature attributes based on correlation</p>
                    </list-item>
                    <list-item>
                        <label>4.</label>
                        <p>Train machine learning models</p>
                    </list-item>
                    <list-item>
                        <label>5.</label>
                        <p>Evaluation of models</p>
                    </list-item>
                </list>
            </p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>
Figure 1. </label>
                <caption>
                    <title>Proposed flow diagram.</title>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure1.gif"/>
            </fig>
            <p>The first step in every machine learning model is data collection, and we collected a dataset by combining two datasets, namely, the Cleveland, Hungary, Switzerland, and Long Beach V. datasets and the archive dataset from Kaggle.
                <sup>
                    <xref ref-type="bibr" rid="ref21">21</xref>,
                    <xref ref-type="bibr" rid="ref22">22</xref>
                </sup> A combination of both datasets recorded 1303 patients with a total of 13 features. In the pre-processing step, missing values are handled by calculating the sum of the neighboring rows. This replacement increases the accuracy of machine learning models. Later, feature selection is performed using the correlation (0.75 and 0.5) between features and features selected by the chi-square test. The next step is to apply machine learning models to predict CVD. The model was evaluated after it was trained with all features and two sets of correlation values. The dataset was divided into training and testing groups in a 70:30 ratio. To avoid overfitting and underfitting, the models were tuned.</p>
            <sec id="sec7">
                <title>2.1. Details of dataset</title>
                <p>The dataset has 13 features, including one feature as a target (which indicates healthy and unhealthy states), and it is extended to every patient. This is a balanced dataset with both categorical and numerical variables, which is ideal for developing predictive models and analyzing heart diseases. When an attribute has less than ten different classes, it is considered categorical or nominal. Some of them include sex, type of chest pain, fasting blood glucose level, ECG findings, and many more. The gender attribute is binary, with &#x2018;1&#x2019; for male and &#x2018;0&#x2019; for female based on the sex attribute. Chest pain type (cp) is categorized into four distinct classes: These subtypes include typical angina, atypical angina, non-anginal chest pain, and asymptomatic chest pain. Fasting blood sugar (fbs) was also binary, depicting a value of more than 120&#x00a0;mg/dL (1 for true, 0 for false).</p>
                <p>Three classes were used to classify the results of the resting electrocardiogram (resting): normal, ST-T wave changes, and definite LVH. Another binary variable, exercise-performed angina (exang), assigns a value of 1 to indicate the presence of chest pain during exertion and a value of 0 to indicate its absence. There were three types of slopes for the peak exercise ST segment: downslope, flat, and upslope. While the fluoroscopy visualization of the number of major vessels (ca) ranges from 0 to 3, the thalassemia attribute (thal) categorizes the heart status as either normal, fixed defect, or reversible defect.</p>
                <p>
The numerical features return continuous data, which is valuable for fine-grained analysis, as opposed to nominal features. These variables included age, resting systolic blood pressure in millimeters of mercury (trestbps), serum cholesterol in milligrams per deciliter (chol), maximum achieved heart rate (thalach), and ST segment depression compared to rest (oldpeak). They record specific clinical parameters that are crucial for assessing the cardiovascular function.</p>
                <p>It is also critical to recognize that the target attribute has been transformed into a binary variable from its original quantification into five classes that represent varying degrees of heart disease risk. This simplifies the problem by transforming it into a classification problem, with the goal of ascertaining whether a person has heart disease [
                    <xref ref-type="table" rid="T1">
Table 1</xref>].</p>
                <table-wrap id="T1" orientation="portrait" position="float">
                    <label>
Table 1. </label>
                    <caption>
                        <title>Feature attributes and its range.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Name of features</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Character of feature</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Range of features</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Age</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">30 to 77</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Sex</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Categorical</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Female is represented as zero. One is for Male.</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cp</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer type: pain in chest</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1 to 4
                                    <break/>

                                    <p>

                                        <list list-type="order">
                                            <list-item>
                                                <label>1-</label>
                                                <p>Typical angina</p>
                                            </list-item>
                                            <list-item>
                                                <label>2-</label>
                                                <p>Atypical angina</p>
                                            </list-item>
                                            <list-item>
                                                <label>3-</label>
                                                <p>Non angina pain</p>
                                            </list-item>
                                            <list-item>
                                                <label>4-</label>
                                                <p>Asymptomatic in nature</p>
                                            </list-item>
                                        </list>
                                    </p>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Trestbps</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer: It is the blood pressure while admitting to hospital</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ranging from 94 to 200</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Chol</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer: It is the cholesterol level</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ranges from 126 to 564</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">fbs</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Blood sugar level before food-logical type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0 -&#x00a0;&gt;&#x00a0;if value &lt;120&#x00a0;mg/dl
                                    <break/>1 -&#x00a0;&gt;&#x00a0;when &gt;120&#x00a0;mg/dl</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Restecg</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Electrocardiographic Categorical</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <p>

                                        <list list-type="order">
                                            <list-item>
                                                <label>1-</label>
                                                <p>Normal</p>
                                            </list-item>
                                            <list-item>
                                                <label>2-</label>
                                                <p>ST T wave, which is abnormal</p>
                                            </list-item>
                                            <list-item>
                                                <label>3-</label>
                                                <p>Definite left ventricular</p>
                                            </list-item>
                                        </list>
                                    </p>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Thalach</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer type: Person&#x2019;s maximum heartbeat</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ranges between 71 to 202</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Exang</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Categorical: angina</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1-Yes
                                    <break/>0-No</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Oldpeak</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer type: ST depression induced by exercise relative to rest</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ranges between 0 to 6.2</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Slope</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Integer type: It is a slope of peak exercise ST segment</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">
                                    <p>

                                        <list list-type="order">
                                            <list-item>
                                                <label>1-</label>
                                                <p>Upsloping</p>
                                            </list-item>
                                            <list-item>
                                                <label>2-</label>
                                                <p>flat</p>
                                            </list-item>
                                            <list-item>
                                                <label>3-</label>
                                                <p>downsloping</p>
                                            </list-item>
                                        </list>
                                    </p>
</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ca</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Major vessels, which is given by coloured fluoroscopy- integer type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ranges between zero to three</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Target</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Text format</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Yes and No (disease present or not)</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>In order to increase number of patients in the dataset combining two datasets namely
                    <list list-type="order">
                        <list-item>
                            <label>1.</label>
                            <p>Cleveland, Hungary, Switzerland, and the VA Long Beach
                                <sup>
                                    <xref ref-type="bibr" rid="ref22">22</xref>
                                </sup> &#x2013; 303 patients</p>
                        </list-item>
                        <list-item>
                            <label>2.</label>
                            <p>Cardiovascular Disease
                                <sup>
                                    <xref ref-type="bibr" rid="ref21">21</xref>
                                </sup> &#x2013; 1000 patients</p>
                        </list-item>
                    </list>
                </p>
                <p>The two datasets shared almost the same features, but different names were provided to the features. 
                    <xref ref-type="table" rid="T2">
Table 2</xref> shows naming conventions; the thallium feature is not present in 2 cardiovascular diseases, hence leaving these features and considering the rest of the features. The statistical data for each attribute, including the minimum, maximum, mean, standard deviation, 25%, 50%, and 75%, are displayed in 
                    <xref ref-type="table" rid="T3">
Table 3</xref>. As a result, various machine learning models have been trained using a combined dataset to identify the classifier that is most effective in detecting CVD.</p>
                <table-wrap id="T2" orientation="portrait" position="float">
                    <label>
Table 2. </label>
                    <caption>
                        <title>Details of feature attributes between two datasets.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Dataset 1: Cardiovascular disease</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Dataset 2: Cleveland</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">
Details</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Age</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Age</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Age of patients are present in both dataset</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Gender</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Sex</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1 as male and 0 as female is encoded</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Chestpain</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Chest pain type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">4 types of chest pain</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">restingBp</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">BP</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">It is resting BP (mm Hg)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Serumcholestrol</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cholesterol</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cholesterol level (mg/dL)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Fastingbloodsugar</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">FBS over 120</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Fasting blood sugar level. 1 if &gt;120
                                    <break/>0 if FBS&#x00a0;&lt;&#x00a0;120</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Restingrelectro</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">EKG results</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">ST wave (electrocardiogram)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Exerciseangia</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Exercise angina</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Presence of angina or not
                                    <break/>1-yes
                                    <break/>0-no</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Oldpeak</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">ST depression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Both features gives same ST segment depression induced by exercise</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Slope</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Slope of ST</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Peak exercise (up, flat and down slope)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Noofmajorvessels</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Number of vessels fluro</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Fluoroscopy type (0&#x2013;4)</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Not present</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Thallium</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Thallium scan result</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Target</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Heart Disease</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Disease presence or not
                                    <break/>1-Yes
                                    <break/>0-No</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <table-wrap id="T3" orientation="portrait" position="float">
                    <label>
Table 3. </label>
                    <caption>
                        <title>Dataset statistical information.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="5" rowspan="1" valign="top">a) Integer type feature attributes details</th>
                            </tr>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Feature/measure</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Age</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Resting bps</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Chol</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">oldpeak</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Mean</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">53</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">132.00</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">217</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.94</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Std</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">9.313</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">18.27</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">95</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1.09</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Min</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">28</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">&#x2212;2.6</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">25%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">47</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">120</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">197</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">50%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">54</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">130</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">233</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0.6</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">75%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">60</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">140</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">271</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1.6</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">max</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">200</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">603</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">6.2</td>
                            </tr>
                        </tbody>
                    </table>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="18" rowspan="1" valign="top">b) Categorical type feature attributes details</th>
                            </tr>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Feature/measure</th>
                                <th align="left" colspan="2" rowspan="1" valign="top">Sex</th>
                                <th align="left" colspan="4" rowspan="1" valign="top">cp</th>
                                <th align="left" colspan="2" rowspan="1" valign="top">fbs</th>
                                <th align="left" colspan="3" rowspan="1" valign="top">restingecg</th>
                                <th align="left" colspan="2" rowspan="1" valign="top">exang</th>
                                <th align="left" colspan="4" rowspan="1" valign="top">slope</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Label</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">2</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">3</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">4</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">2</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">0</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Percentage of occurrence (%)</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">28</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">72</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">6</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">19</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">27.85</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">47.2</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">89.5</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">10.5</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">62</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">3</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">35</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">69.9</td>
                                <td align="center" colspan="1" rowspan="1" valign="top">30.1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">44</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">48</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">7</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Missing values</td>
                                <td align="center" colspan="2" rowspan="1" valign="top">Zero</td>
                                <td align="center" colspan="4" rowspan="1" valign="top">Zero</td>
                                <td align="center" colspan="2" rowspan="1" valign="top">Zero</td>
                                <td align="center" colspan="3" rowspan="1" valign="top">Zero</td>
                                <td align="center" colspan="2" rowspan="1" valign="top">Zero</td>
                                <td align="center" colspan="4" rowspan="1" valign="top">Zero</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
            <sec id="sec8">
                <title>2.2. Pre-processing of dataset</title>
                <p>If no value is assigned to a particular variable in an observation, it is referred to as missing or incomplete. These missing values can originate from different circumstances, such as when the respondent fails to answer some questions, failure of the sensor, data loss during transfer, disruption in the network connection, or a mathematical operation, such as division by zero. In datasets [
                    <xref ref-type="fig" rid="f2">
Figure 2</xref>], missing values can be indicated by spaces, hyphens, or other marks that differentiate them from other regular values.</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>
Figure 2. </label>
                    <caption>
                        <title>Visualization of features in dataset.</title>
                    </caption>
                    <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure2.gif"/>
                </fig>
                <p>The missing values may or may not affect the statistical validity of the outcome. However, the outcomes may lack robustness or accuracy due to omitted information, even though analysis can move forward with incomplete data. Even if each individual variable has only a small percentage of missing data, the cumulative amount across the dataset may be significant, thereby influencing the analysis results. It is worthwhile to replace the observations rather than deleting them because observations with missing values can be quite informative. The strategies include:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>It preferable to use the mean value of the variable of interest to avoid bias.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>The Cleveland heart dataset contains missing values for the nominal attributes thal (thalassemia) and ca (the number of major vessels). These were handled as follows.</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>This attribute had four items with missing values that were replaced by the most frequent value, which was zero and occurred 176 times in 299 records.</p>
                        </list-item>
                    </list>
                </p>
            </sec>
            <sec id="sec9">
                <title>2.3. Classification by machine learning models</title>
                <p>Earlier studies used multiple supervised machine learning models on a single dataset to predict CVD. Six machine learning models are discussed in this section to predict patients who may be at risk of CVD.</p>
                <p>Regression models estimate the target variables on a continuum rather than grouping the outcomes into categories. Linear regression is the basic form of regression analysis, which assumes linearity and attempts to minimize the error between the actual and predicted values. Other types include polynomial and ridge regression, which are for more complex relationships, or when regularization is required.
                    <sup>
                        <xref ref-type="bibr" rid="ref22">22</xref>,
                        <xref ref-type="bibr" rid="ref23">23</xref>
                    </sup>
                </p>
                <p>Na&#x00ef;ve Bayes is a probabilistic classifier that relies on the assumption of feature independence and is derived from the base Bayes formula. This algorithm is computationally efficient and is widely applied to text categorization, spam detection, and sentiment analysis. However, it is efficient and can work well in various real-life situations, particularly when dealing with big data. Based on the input features, it assigns an instance to the class with the highest posterior probability.
                    <sup>
                        <xref ref-type="bibr" rid="ref24">24</xref>
                    </sup>
                </p>
                <p>Random Forest is another type of meta-cascade that is composed of many decision trees, where the bagging technique is adopted for increased precision and stability. A subset of the data is used to train each tree, and the final decision (class for classification or mean for regression) is made by considering the results of all the trees. This is less prone to overfitting than individual decision trees, and is flexible for tasks involving structured data.
                    <sup>
                        <xref ref-type="bibr" rid="ref25">25</xref>
                    </sup>
                </p>
                <p>Logistic regression is a classification algorithm that predicts probabilities by applying a sigmoid function, which is appropriate when the output is binary. Although closely related to linear regression, it transforms the results into the range of [0, 1]. It has been applied in disease prediction, email classification, and customer churn analysis because of its simplicity and interpretability.
                    <sup>
                        <xref ref-type="bibr" rid="ref27">27</xref>
                    </sup>
                </p>
                <p>The gradient-boosting framework XGBoost is popular and often wins machine-learning competitions because it is built for high performance. It iteratively constructs decision trees, gradually corrects the mistakes made in prior steps, and employs shrinkage to curb overlearning. It performs well when dealing with missing values and large structures.
                    <sup>
                        <xref ref-type="bibr" rid="ref14">14</xref>
                    </sup> Modifications were performed to avoid under-and overfitting.
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>The maximum depth was set to five, and the tree did not grow beyond level 5. If a deeper model is overfitted, it is restricted to five.
                                <sup>
                                    <xref ref-type="bibr" rid="ref14">14</xref>
                                </sup>
                            </p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>The delta step is set to 0.1: when there is an imbalanced dataset, the model learning rate will be very low. During these stages, the small-delta step model adjusts the weights accordingly and reduces overfitting.
                                <sup>
                                    <xref ref-type="bibr" rid="ref25">25</xref>
                                </sup>
                            </p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>Gamma is set to 0.6; this term is called regularization, which controls the complexity of the model when new trees are added. 0.6 gives good results during new tree additions.
                                <sup>
                                    <xref ref-type="bibr" rid="ref14">14</xref>
                                </sup>
                            </p>
                        </list-item>
                    </list>
                </p>
                <p>These modifications in the model help to fight during imbalance dataset and avoids overfitting.</p>
                <p>LightGBM is a gradient boosting framework that supports multiclass classification and regression problems and is based on a speed-optimized decision tree. It contains a histogram-based algorithm to build a decision tree for the learning process and manages the memory well. Its approach, such as leaf-wise tree growth, makes it faster than XGBoost for a variety of tasks including recommendations, clicks, and ranking.
                    <sup>
                        <xref ref-type="bibr" rid="ref26">26</xref>
                    </sup>
                </p>
            </sec>
            <sec id="sec10">
                <title>2.4. Features evaluation</title>
                <p>To select important features for improving classification that are correlation-based (with two 0.75 and 0.5), chi-square feature selection is employed in the current research. Correlation describes the similarity between two components. This method determines which feature is highly efficient in predicting targets individually. The highly correlated features [
                    <xref ref-type="table" rid="T4">
Table 4</xref>] between each other, have considered two thresholds to collect correlated values are 70% and 50% of target values. These features have low redundancy and are highly relevant to the target classes.
                    <sup>
                        <xref ref-type="bibr" rid="ref28">28</xref>
                    </sup>
                </p>
                <table-wrap id="T4" orientation="portrait" position="float">
                    <label>
Table 4. </label>
                    <caption>
                        <title>Highly correlated features to target values.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Correlation value</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Feature name</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Feature number</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="3" valign="top">75%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Chest pain type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Exercise_angin</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">9</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">St_slope</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">11</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="5" valign="top">50%</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Sex</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">2</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Chest pain type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Fbs</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">6</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Exercise_sngin</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">9</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">St_slope</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">11</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>Another feature selection method called chi-square attribute evaluation is the arrangement of filters in the order of the computed chi-square statics
                    <sup>
                        <xref ref-type="bibr" rid="ref29">29</xref>
                    </sup> of 10 features. It computes the score of all features with the target values and selects the top ten features. The attributes selected from the chi-square [
                    <xref ref-type="table" rid="T5">
Table 5</xref>] by eliminating the last two features and training with chi-square-suggested features using machine learning algorithms.</p>
                <table-wrap id="T5" orientation="portrait" position="float">
                    <label>
Table 5. </label>
                    <caption>
                        <title>Selected features from chi-square.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Feature number</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Feature name</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Rank of test</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">1</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Age</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">134.75</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">3</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Chest pain type</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">56.027</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">4</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Resting bp</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">53.44</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">5</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Cholesterol</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">913.06</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">6</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Fasting blood sugar</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">43.41</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">8</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Max heart rate</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">1656.44</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">9</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Exercise angin</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">111.004</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">10</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">Old peak</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">524.05</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">11</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">St slope</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">51.71</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">12</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">num of vessels fluro</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">32.40</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
            </sec>
        </sec>
        <sec id="sec11" sec-type="results">
            <title>3. Results</title>
            <p>This section provides an overview of the results obtained by using the proposed model. The implementation starts with a collection of datasets. We combined the two datasets
                <sup>
                    <xref ref-type="bibr" rid="ref21">21</xref>,
                    <xref ref-type="bibr" rid="ref22">22</xref>
                </sup> to form 1303 patients. The details of the dataset consist of 13 features, including a target feature for healthy and unhealthy individuals, and are ideal for developing predictive models and analyzing heart disease. It includes categorical and numerical variables, such as sex, chest pain type, fasting blood glucose level, ECG findings, exercise-performed angina, and the slope of the peak exercise ST segment. Numerical features, such as age, resting systolic blood pressure, serum cholesterol, maximum achieved heart rate, and ST segment depression compared to rest, provide continuous data for fine-grained analysis.</p>
            <p>The target attribute, initially quantified into five classes, was converted into a binary variable, making it easier to determine whether a person had heart disease. Considering these features, models are trained and evaluated, and this is performed by selecting three different sets of features from 13 features. First, all models are trained using the full 13 features; second, all models are trained using the features selected by the chi-square test. Finally, the models are trained using the features selected by the correlation. Had chosen the correlation 0.75 and 0.50 each other because 0.75 indicates a strong relationship between the features. On the other hand, 0.50, was moderately related to each other. This provides the highest and moderate features that help achieve higher accuracy.</p>
            <sec id="sec12">
                <title>3.1. Dataset information</title>
                <p>The dataset employed for all cases is initially discussed, and the results for different cases are provided in the following subsections.</p>
                <p>The instance of the employed dataset is obtained by combining the two datasets [
                    <xref ref-type="fig" rid="f3">
Figure 3</xref>], namely Cleveland, Hungary, Switzerland, and the VA Long Beach
                    <sup>
                        <xref ref-type="bibr" rid="ref23">23</xref>
                    </sup> with 303 patient data and Cardiovascular Disease
                    <sup>
                        <xref ref-type="bibr" rid="ref21">21</xref>
                    </sup> with 1000 patient data.</p>
                <fig fig-type="figure" id="f3" orientation="portrait" position="float">
                    <label>
Figure 3. </label>
                    <caption>
                        <title>Instance of employed dataset with visualization of counts.</title>
                    </caption>
                    <graphic id="gr3" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure3.gif"/>
                </fig>
                <p>The dataset information [
                    <xref ref-type="fig" rid="f4">
Figure 4</xref>], which includes the total number of entries, number of features recorded per patient, data type, and count of NaN values (black, NA data, or missing values).</p>
                <fig fig-type="figure" id="f4" orientation="portrait" position="float">
                    <label>
Figure 4. </label>
                    <caption>
                        <title>Information of dataset.</title>
                    </caption>
                    <graphic id="gr4" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure4.gif"/>
                </fig>
                <p>
The distribution of healthy and unhealthy patients [
                    <xref ref-type="fig" rid="f5">
Figure 5</xref>] indicates that the dataset is imbalanced in nature. To address this, a few modifications are made to the XGboost algorithm, as stated in Section 3.3. To visualize missing values, a bar chart is plotted using msno, which provides data points for every column if missing values are present in any column gaps present in the bar charts. There is no discontinuity in the graph [
                    <xref ref-type="fig" rid="f6">
Figure 6</xref>], which clearly indicates that there are no missing values in any column, and the 1303 values indicate that all columns have 1303 entries in the dataset. The heat map [
                    <xref ref-type="fig" rid="f7">
Figure 7</xref>] for the dataset is computed, which gives Pearson correlation coefficients and denotes &#x2212;1,+1, and 0&#x00a0;+&#x00a0;1, indicating that the value is highly correlated, &#x2212;1 is a negative correlation, and 0 means no correlation between the target and features. The data [
                    <xref ref-type="fig" rid="f8">
Figure 8</xref>] for training and testing values. If 50% is shown, then the dataset is balanced.</p>
                <fig fig-type="figure" id="f5" orientation="portrait" position="float">
                    <label>
Figure 5. </label>
                    <caption>
                        <title>Count of healthy and unhealthy CVD patients in Dataset.</title>
                    </caption>
                    <graphic id="gr5" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure5.gif"/>
                </fig>
                <fig fig-type="figure" id="f6" orientation="portrait" position="float">
                    <label>
Figure 6. </label>
                    <caption>
                        <title>Missing values are demonstrated.</title>
                    </caption>
                    <graphic id="gr6" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure6.gif"/>
                </fig>
                <fig fig-type="figure" id="f7" orientation="portrait" position="float">
                    <label>
Figure 7. </label>
                    <caption>
                        <title>Heat map of dataset.</title>
                    </caption>
                    <graphic id="gr7" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure7.gif"/>
                </fig>
                <fig fig-type="figure" id="f8" orientation="portrait" position="float">
                    <label>
Figure 8. </label>
                    <caption>
                        <title>True and false cases allotted for training and testing in dataset.</title>
                    </caption>
                    <graphic id="gr8" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure8.gif"/>
                </fig>
            </sec>
            <sec id="sec13">
                <title>3.2. Calculation of Performance metrics for Full Feature Attributes</title>
                <p>The performance metrics are listed in 
                    <xref ref-type="table" rid="T6">
Table 6</xref>, which includes the confusion matrix, accuracy, precision, recall, and F1 score for the different machine learning models employed. For the analysis in this section, we consider a full set of features. A full set of features means that all 13 features are considered for predicting the health of a patient. Eight machine-learning models have been considered for prediction. To determine the model&#x2019;s performance, we considered metrics such as the accuracy, precision, recall, F1 score, and confusion matrix. Of these machine learning models, random forests outperformed in the detection of diseases. It achieved a high accuracy of 97.77 and rest metrics such as precision, recall, F1 score of 98.</p>
                <table-wrap id="T6" orientation="portrait" position="float">
                    <label>
Table 6. </label>
                    <caption>
                        <title>Performance matrix for training data for different machine learning models.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Name of the model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Precision</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">recall</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">F1 score</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Confusion matrix</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Linear regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">82</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[207 33]
                                    <break/>[36 115]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Lasso regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[201 39]
                                    <break/>[44 107]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ridge regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">82</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[206 34]
                                    <break/>[36 115]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Na&#x00ef;ve bayes</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80.15</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[417 91]
                                    <break/>[90 314]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Random forest</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">97.77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[496 12]
                                    <break/>[9 395]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Logistic Regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79.61</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[419 89]
                                    <break/>[97 307]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">XGboost</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87.6</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[440 68]
                                    <break/>[46 358]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">LightGBM</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">89</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[443 65]
                                    <break/>[40 364]]</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <p>In addition to the accuracy, the model achieved good values in other metrics, which shows that the model is robust to real-world data. In the confusion matrix, the predictions of True Positive (TP) and true negative (TN) were quite good; there were fewer than 12 and 9 patients that were misclassified as false positive and false negative, respectively. This indicates that the misclassification from the model was comparatively less. A graphical representation of the performance metrics considering the full feature set. 
                    <xref ref-type="fig" rid="f9">
Figure 9</xref> shows the performance metrics in a visual chart format. This provides a clear view of which model provides greater accuracy in detecting the heart condition of a patient.</p>
                <fig fig-type="figure" id="f9" orientation="portrait" position="float">
                    <label>
Figure 9. </label>
                    <caption>
                        <title>Graphical representation of performance metrics for full feature sets.</title>
                    </caption>
                    <graphic id="gr9" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure9.gif"/>
                </fig>
            </sec>
            <sec id="sec14">
                <title>3.3. Calculation of Performance metrics for correlation 0.75</title>
                <p>Performance metrics [
                    <xref ref-type="table" rid="T7">
Table 7</xref>] by considering correlation of 0.75 and 0.50 and feature names are chest pain type, exercise angin and st_slope. It also includes the confusion matrix, accuracy, precision, recall, and F1 score for the different machine-learning models employed. The analysis in this section did not consider the full set of features. Instead of all features, 0.75 correlation features are selected. This means that out of 13 features, three correlated features were selected to predict the health of the heart. We considered eight machine learning models for prediction. To determine the model performance, we considered metrics such as the accuracy, precision, recall, F1 score, and confusion matrix. Of these machine learning models, LightGBM outperformed the others in the detection of diseases. It achieved a high accuracy of 77 and metrics, such as precision, recall, and F1 scores of 76, 77, and 76, respectively. As the number of features decreases, it becomes difficult for the model to learn. The accuracy of the model is lower because three features are considered to train the model. Similarly, in accordance with the accuracy, other metric values are also lower. The confusion matrix shows TP and TN values of 186 and 116, respectively, which are good, but not excellent, to adopt models for real-world patient analysis. False positives and negatives are 54 and 35, respectively, which give incorrect predictions. Due to the wrong prediction of healthy as unhealthy and unhealthy patients as healthy lands, the entire family is at risk. Unhealthy patients are left untreated, and healthy patients undergo unnecessary treatment and financial overhead. A graphical representation of the performance metrics considering the full feature set. The performance metrics [
                    <xref ref-type="fig" rid="f10">
Figure 10</xref>], making them easier to interpret than tables. This helps to highlight which model achieves higher accuracy in identifying heart conditions in patients.</p>
                <table-wrap id="T7" orientation="portrait" position="float">
                    <label>
Table 7. </label>
                    <caption>
                        <title>Performance metrics by considering features based on correlation 0.75.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Name of the model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Precision</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">recall</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">F1 score</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Confusion matrix</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Linear regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[185 55]
                                    <break/>[33 118]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Lasso regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">75</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">75</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">75</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[193 47]
                                    <break/>[46 105]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ridge regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[185 55]
                                    <break/>[33 118]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Na&#x00ef;ve bayes</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[379 129]
                                    <break/>[78 326]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Random forest</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[186 54]
                                    <break/>[35 116]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Logistic Regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[402 106]
                                    <break/>[108 296]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">XGboost</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[406 102]
                                    <break/>[100 304]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">LightGBM</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[384 124]
                                    <break/>[76 328]]</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <fig fig-type="figure" id="f10" orientation="portrait" position="float">
                    <label>
Figure 10. </label>
                    <caption>
                        <title>Performance metrics by considering feature set of 0.75 correlated values.</title>
                    </caption>
                    <graphic id="gr10" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure10.gif"/>
                </fig>
            </sec>
            <sec id="sec15">
                <title>3.4. Calculation of Performance metrics for correlation 0.5</title>
                <p>Performance metrics [
                    <xref ref-type="table" rid="T8">
Table 8</xref>] by considering correlation of 0.75 and 0.50 and feature names are sex, chest pain, fasting blood sugar, exercise_angin and st_slope. The analysis in this section did not consider the full set of features. Instead of all the features, with 0. Five correlation features are selected. This means that out of 13 features, three correlated features were selected to predict the health of the heart. Eight machine-learning models have been considered for prediction. To determine the model&#x2019;s performance, we considered metrics such as the accuracy, precision, recall, F1 score, and confusion matrix. Of these machine learning models, random forest outperformed in the detection of diseases. It achieved a high accuracy of 80.37, and other metrics, such as precision, recall, and F1 score, were 80. As the number of features decreases, it becomes difficult for the model to learn. The accuracy of the model is lower because three features are considered to train the model. In a similar fashion, in accordance with the accuracy, other metric values are also less. The confusion matrix shows TP and TN values of 103 and 76, respectively, which are good but not excellent for the adoption of models for real-world patient analysis. False positives and negatives are 103 and 76, respectively, which give incorrect predictions. Owing to incorrect predictions, there will be unnecessary overhead for the respective families. A graphical representation of the performance metrics considering the full feature set [
                    <xref ref-type="fig" rid="f11">
Figure 11</xref>] provides a good overview of an eagle&#x2019;s eye view by observing line graphs.</p>
                <table-wrap id="T8" orientation="portrait" position="float">
                    <label>
Table 8. </label>
                    <caption>
                        <title>Performance metrics by considering features based on correlation 0.50.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Name of the model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Precision</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">recall</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">F1 score</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Confusion matrix</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Linear regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[189 51]
                                    <break/>[28 123]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Lasso regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[197 43]
                                    <break/>[41 110]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ridge regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[190 50]
                                    <break/>[29 122]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Na&#x00ef;ve bayes</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77.3</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[391 117]
                                    <break/>[90 314]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Random forest</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80.37</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[405 103]
                                    <break/>[76 328]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Logistic Regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77.49</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">76</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[200 40]
                                    <break/>[48 103]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">XGboost</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[408 100]
                                    <break/>[84 320]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">LightGBM</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">79</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[409 99]
                                    <break/>[87 317]]</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <fig fig-type="figure" id="f11" orientation="portrait" position="float">
                    <label>
Figure 11. </label>
                    <caption>
                        <title>Performance metrics for 0.50 correlated values.</title>
                    </caption>
                    <graphic id="gr11" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure11.gif"/>
                </fig>
            </sec>
            <sec id="sec16">
                <title>3.5. Calculation of Performance metrics for chi-square
</title>
                <p>The performance metrics [
                    <xref ref-type="table" rid="T9">
Table 9</xref>] using the chi-square feature matrix, and the feature names are age, chest pain, resting_bp, fasting blood sugar, max_heart_beat, exercise_angin, oldpeak, st_slope, and num_vessls_fluro. The analysis in this section did not consider the full set of features. Instead of all feature sets, we selected features using chi-square. This means that out of 13 features, nine features were selected to predict the health of the heart. Eight machine-learning models have been considered for prediction. To determine the model performance, metrics such as accuracy, precision, recall, F1 score, and confusion matrix were used. Of these machine learning models, random forest outperformed in the detection of diseases. It achieved a high accuracy of 98 and other metrics, such as precision, recall, and F1 score of 98. The selected features are robust and efficient, facilitating efficient detection. In addition to accuracy, precision, recall, f1 score and confusion matrix were used to measure the efficiency of the model. The confusion matrix shows TP and TN values of 495 and 396, respectively, which are good, but not excellent, to adopt models for real-world patient analysis. False positives and negatives are 13 and 8, respectively, which give incorrect predictions. There are a few numbers of wrong predictions, which help in providing excellent results in detecting the health of a patient&#x2019;s heart. A graphical representation [
                    <xref ref-type="fig" rid="f12">
Figure 12</xref>] of the performance metrics considering the full feature set. A clearer view can be seen in the chart compared to the table, which makes it easier to understand which model&#x2019;s performance is better than the other seven machine-learning models.</p>
                <table-wrap id="T9" orientation="portrait" position="float">
                    <label>
Table 9. </label>
                    <caption>
                        <title>Performance metrics by considering features based on chi-square.</title>
                    </caption>
                    <table content-type="article-table" frame="hsides">
                        <thead>
                            <tr>
                                <th align="left" colspan="1" rowspan="1" valign="top">Name of the model</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Accuracy</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Precision</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">recall</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">F1 score</th>
                                <th align="left" colspan="1" rowspan="1" valign="top">Confusion matrix</th>
                            </tr>
                        </thead>
                        <tbody>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Linear regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">82</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[206 34]
                                    <break/>[36 115]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Lasso regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">78</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">77</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[200 40]
                                    <break/>[45 106]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Ridge regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">82</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">81</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[206 34]
                                    <break/>[37 114]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Na&#x00ef;ve bayes</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[404 104]
                                    <break/>[77 327]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Random forest</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">98</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[495 13]
                                    <break/>[8 396]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Logistic Regression</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">80</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[415 93]
                                    <break/>[91 313]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">Xgboost</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">87</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[440 68]
                                    <break/>[52 352]]</td>
                            </tr>
                            <tr>
                                <td align="left" colspan="1" rowspan="1" valign="top">LightGBM</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">88</td>
                                <td align="left" colspan="1" rowspan="1" valign="top">[[442 66]
                                    <break/>[44 360]]</td>
                            </tr>
                        </tbody>
                    </table>
                </table-wrap>
                <fig fig-type="figure" id="f12" orientation="portrait" position="float">
                    <label>
Figure 12. </label>
                    <caption>
                        <title>Performance analysis for chi-square selected feature set.</title>
                    </caption>
                    <graphic id="gr12" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/196399/22dbab05-52a0-45e0-8046-6c44c7522007_figure12.gif"/>
                </fig>
                <p>In printed volumes, illustrations are generally black and white (halftones), and only in exceptional cases, and if the author is prepared to cover the extra cost for color reproduction, are color pictures accepted. If color illustrations are necessary, please send color-separated files if possible. Color pictures are welcomed in the electronic version at no additional cost. The current study utilized two datasets, which were combined by leaving the data from the Cleveland dataset and combined with the CVD dataset to make 1303 patients to predict CVD risk. With all feature sets, the maximum accuracy achieved was 97.77% the random forest algorithm. Not only is accuracy considered, other performance metrics such as f1 score (98%), recall (98%), precision (98%) models are considered as models that provide robustness in detection. The feature selection processes considered were correlation-based and Chi-square-based feature selection. Results analysis showed that the chi-square selected features list achieved a high accuracy, precision, f1 score and recall. Highest accuracy achieved compared to full feature set the chi-square feature selection method was 98%.</p>
            </sec>
        </sec>
        <sec id="sec17" sec-type="conclusion">
            <title>4. Conclusion</title>
            <p>The current study focuses on feature selection and evaluation of machine learning models to find robust models for predicting CVD risk earlier. The employed datasets were Cleveland, Hungary, Switzerland, and the VA Long Beach
                <sup>
                    <xref ref-type="bibr" rid="ref21">21</xref>
                </sup> with 303 patient data and cardiovascular disease
                <sup>
                    <xref ref-type="bibr" rid="ref20">20</xref>
                </sup> with 1000 patient data. These two datasets were combined to form data from 1303 patients by ignoring that feature from Cleveland, as it is not present in the cardiovascular disease dataset. Three types of feature selection mechanisms were adopted: correlation-based, with correlated values of 0.75, 0.5, and chi-square. Of these feature selection methods, the chi-square-based feature list achieved good accuracy. Excellent performance is achieved by a random forest algorithm in predicting CVD using features selected by chi-square. The highest accuracy of 98% was achieved performed by the random forest classifier. The current research covers six machine learning models and two regression models, and eight machine learning models were trained and evaluated on features selected by three feature selection methods. In the future, one can apply deep learning models can be applied to increase the accuracy of the model and to adopt more feature selection methods and models to predict CVD at earlier stages.</p>
        </sec>
        <sec id="sec18">
            <title>Ethical Approval</title>
            <p>Not Applicable.</p>
        </sec>
        <sec id="sec19">
            <title>Consent to participate</title>
            <p>Not Applicable.</p>
        </sec>
        <sec id="sec20">
            <title>Consent to Publish</title>
            <p>Not Applicable.</p>
        </sec>
    </body>
    <back>
        <sec id="sec24" sec-type="data-availability">
            <title>Data availability</title>
            <p>The datasets generated and/or analyzed during the current study are available in the Cardiovascular Disease Dataset, Mendeley Data, and the UCI Machine Learning Repository.</p>
            <p>

                <bold>Repository Name:</bold> Combined Heart Patient Data, Mendeley Data, V1,</p>
            <p>doi:
                <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.17632/v54h5d5pvt.1">10.17632/v54h5d5pvt.1</ext-link>; Reserved DOI: 
                <ext-link ext-link-type="uri" xlink:href="http://dx.doi.org/10.17632/v54h5d5pvt.1">10.17632/v54h5d5pvt.1</ext-link>
            </p>
            <p>The project contains the following underlying data: combined_heart_patient_data.csv.
                <sup>
                    <xref ref-type="bibr" rid="ref30">30</xref>
                </sup>
            </p>
            <p>Data are available under the terms of the 
                <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication)</ext-link>.</p>
        </sec>
        <ref-list>
            <title>References</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="other">
                    <ext-link ext-link-type="uri" xlink:href="https://world-heart-federation.org/wp-content/uploads/World_Heart_Report_2025_Online-Version.pdf">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="other">
                    <ext-link ext-link-type="uri" xlink:href="https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="other">
                    <collab>Healthline</collab>:
 (accessed on 20 February 2021).
                    <ext-link ext-link-type="uri" xlink:href="https://www.healthline.com/health/stroke-vs-heart-attack#treatment">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chicco</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jurman</surname>
                            <given-names>G</given-names>
                        </name>
</person-group>:
                    <article-title>Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone.</article-title>
                    <source>

                        <italic toggle="yes">BMC Med. Inform. Decis. Mak.</italic>
</source>
                    <year>2020</year>;<volume>20</volume>:<fpage>1</fpage>&#x2013;<lpage>16</lpage>.
                    <pub-id pub-id-type="doi">10.1186/s12911-020-1023-5</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Obasi</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Shafiq</surname>
                            <given-names>MO</given-names>
                        </name>
</person-group>:
                    <chapter-title>Towards comparing and using Machine Learning techniques for detecting and predicting Heart Attack and Diseases.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 2019 IEEE International Conference on Big Data, Big Data 2019, Los Angeles, CA, USA, 9&#x2013;12 December 2019.</italic>
</source>pp.<fpage>2393</fpage>&#x2013;<lpage>2402</lpage>.</mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sharma</surname>
                            <given-names>H</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Rizvi</surname>
                            <given-names>MA</given-names>
                        </name>
</person-group>:
                    <article-title>Prediction of Heart Disease using Machine Learning Algorithms: A Survey.</article-title>
                    <source>

                        <italic toggle="yes">Int J Recent Innov Trends Comput Commun.</italic>
</source>
                    <year>2017</year>;<volume>5</volume>:<fpage>99</fpage>&#x2013;<lpage>104</lpage>.</mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Amogha</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Deshpande</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <chapter-title>A Review on Behavioural Biometric GAIT Recognition.</chapter-title>
                    <person-group person-group-type="editor">

                        <name name-style="western">
                            <surname>Gunjan</surname>
                            <given-names>VK</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zurada</surname>
                            <given-names>JM</given-names>
                        </name>
</person-group>, editors.
                    <source>

                        <italic toggle="yes">Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications. Lecture Notes in Networks and Systems.</italic>
</source>
                    <publisher-loc>Singapore</publisher-loc>:
                    <publisher-name>Springer</publisher-name>;<year>2023</year>; vol<volume>540</volume>.
                    <pub-id pub-id-type="doi">10.1007/978-981-19-6088-8_9</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Goel</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Deep</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Srivastava</surname>
                            <given-names>S</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Comparative Analysis of various Techniques for Heart Disease Prediction.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 2019 4th International Conference on Information Systems and Computer Networks, ISCON 2019, Mathura, India, 21&#x2013;22 November 2019.</italic>
</source>pp.<fpage>88</fpage>&#x2013;<lpage>94</lpage>.</mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>Heart disease prediction utilizing machine learning techniques.</article-title>
                    <source>

                        <italic toggle="yes">Transactions on Materials, Biotechnology and Life Sciences.</italic>
</source>
                    <year>2024</year>;<volume>3</volume>:<fpage>35</fpage>&#x2013;<lpage>50</lpage>.
                    <pub-id pub-id-type="doi">10.62051/e054hq43</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bolanle</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Elizabeth</surname>
                        </name>

                        <name name-style="western">
                            <surname>Ali</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Chi-Square and Cuckoo Search Based Feature Selection for Heart Disease Prediction.</article-title>
                    <year>2024</year>. undefined.
                    <pub-id pub-id-type="doi">10.1109/seb4sdg60871.2024.10630086</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Sowmiya</surname>
                            <given-names>M</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Classification of Cardiovascular Diseases from Magnetic Resonance Imaging using Classifiers.</chapter-title>
                    <source>

                        <italic toggle="yes">2024 International Conference on Smart Systems for Electrical, Electronics, Communication and Computer Engineering (ICSSEECC).</italic>
</source>
                    <year>2024</year>;<fpage>528</fpage>&#x2013;<lpage>533</lpage>.</mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chen</surname>
                            <given-names>T</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Guestrin</surname>
                            <given-names>C</given-names>
                        </name>
</person-group>:
                    <chapter-title>Xgboost: A scalable tree boosting system.</chapter-title>
                    <source>

                        <italic toggle="yes">Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining.</italic>
</source>
                    <year>2016</year>.</mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Islam</surname>
                            <given-names>MR</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <chapter-title>Data-Driven Heart Disease Prediction by Ensemble Feature Selection and Machine Learning Techniques.</chapter-title>
                    <source>

                        <italic toggle="yes">2022 25th International Conference on Computer and Information Technology (ICCIT).</italic>
</source>
                    <year>2022</year>;<fpage>575</fpage>&#x2013;<lpage>580</lpage>.</mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Patro</surname>
                            <given-names>SP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Padhy</surname>
                            <given-names>N</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sah</surname>
                            <given-names>RD</given-names>
                        </name>
</person-group>:
                    <chapter-title>A Road Map for Classification of Heart Disease Using Machine Learning Classifier.</chapter-title>
                    <source>

                        <italic toggle="yes">Next Generation of Internet of Things: Proceedings of ICNGIoT 2022.</italic>
</source>
                    <publisher-loc>Singapore</publisher-loc>:
                    <publisher-name>Springer Nature Singapore</publisher-name>;<year>2022</year>;<fpage>687</fpage>&#x2013;<lpage>702</lpage>.</mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Oleiwi</surname>
                            <given-names>ZC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>AlShemmary</surname>
                            <given-names>EN</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Al-augby</surname>
                            <given-names>S</given-names>
                        </name>
</person-group>:
                    <article-title>Adaptive features selection technique for efficient heart disease prediction.</article-title>
                    <source>

                        <italic toggle="yes">J Al-Qadisiyah Comput Sci Math.</italic>
</source>
                    <year>2023</year>;<volume>15</volume>(<issue>1</issue>):<fpage>1</fpage>.
                    <pub-id pub-id-type="doi">10.29304/jqcm.2023.15.1.1137</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Murugesan</surname>
                            <given-names>G</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Kavitha</surname>
                            <given-names>CT</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jabakumar</surname>
                            <given-names>GG</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Prediction of Heart Disease using Machine Learning Algorithms with Feature Selection Techniques.</article-title>
                    <source>

                        <italic toggle="yes">Cardiometry.</italic>
</source>
                    <year>2023</year>;<volume>26</volume>:<fpage>778</fpage>&#x2013;<lpage>786</lpage>.
                    <pub-id pub-id-type="doi">10.18137/cardiometry.2023.26.778786</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Dissanayake</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Johar</surname>
                            <given-names>MGM</given-names>
                        </name>
</person-group>:
                    <article-title>Heart Disease Diagnostics Using Meta-Learning-Based Hybrid Feature Selection.</article-title>
                    <source>

                        <italic toggle="yes">Appl Comput Intell Soft Comput.</italic>
</source>
                    <year>2024</year>;<volume>2024</volume>(<issue>1</issue>):<fpage>8800497</fpage>.
                    <pub-id pub-id-type="doi">10.1155/2024/8800497</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref18">
                <label>18</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Akyol</surname>
                            <given-names>K</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Atila</surname>
                            <given-names>&#x00dc;</given-names>
                        </name>
</person-group>:
                    <article-title>A study on performance improvement of heart disease prediction by attribute selection methods.</article-title>
                    <source>

                        <italic toggle="yes">Acad Platform-J Eng Sci.</italic>
</source>
                    <year>2019</year>;<volume>7</volume>(<issue>2</issue>):<fpage>174</fpage>&#x2013;<lpage>179</lpage>.</mixed-citation>
            </ref>
            <ref id="ref19">
                <label>19</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chaurasia</surname>
                            <given-names>V</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chaurasia</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>Novel method of characterization of heart disease prediction using sequential feature selection-based ensemble technique.</article-title>
                    <source>

                        <italic toggle="yes">Biomed. Mater. Devices.</italic>
</source>
                    <year>2023</year>;<volume>1</volume>(<issue>2</issue>):<fpage>932</fpage>&#x2013;<lpage>941</lpage>.
                    <pub-id pub-id-type="doi">10.1007/s44174-022-00060-x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref20">
                <label>20</label>
                <mixed-citation publication-type="data">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Doppala</surname>
                            <given-names>BP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Bhattacharyya</surname>
                            <given-names>D</given-names>
                        </name>
</person-group>:
                    <data-title>Cardiovascular_Disease_Dataset.</data-title>
                    <source>

                        <italic toggle="yes">Mendeley Data.</italic>
</source>
                    <year>2021</year>;<volume>V1</volume>.
                    <pub-id pub-id-type="doi">10.17632/dzz48mvjht.1</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref21">
                <label>21</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Janosi</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>;
                    <chapter-title>Heart Disease.</chapter-title>
                    <source>

                        <italic toggle="yes">UCI Machine Learning Repository.</italic>
</source>
                    <year>1989</year>.
                    <pub-id pub-id-type="doi">10.24432/C52P4X</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref22">
                <label>22</label>
                <mixed-citation publication-type="book">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Gelman</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Carlin</surname>
                            <given-names>JB</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Stern</surname>
                            <given-names>HS</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <source>

                        <italic toggle="yes">Bayesian Data Analysis.</italic>
</source>
                    <edition>3rd ed</edition>
                    <publisher-name>Chapman and Hall/CRC</publisher-name>;<year>2013</year>.
                    <pub-id pub-id-type="doi">10.1201/b16018</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref23">
                <label>23</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Hastie</surname>
                            <given-names>T</given-names>
                        </name>
</person-group>:
                    <article-title>The elements of statistical learning: data mining, inference, and prediction.</article-title>
                    <year>2009</year>.</mixed-citation>
            </ref>
            <ref id="ref24">
                <label>24</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Breiman</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>Random forests.</article-title>
                    <source>

                        <italic toggle="yes">Mach. Learn.</italic>
</source>
                    <year>2001</year>;<volume>45</volume>:<fpage>5</fpage>&#x2013;<lpage>32</lpage>.
                    <pub-id pub-id-type="doi">10.1023/A:1010933404324</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref25">
                <label>25</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ke</surname>
                            <given-names>G</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Lightgbm: A highly efficient gradient boosting decision tree.</article-title>
                    <source>

                        <italic toggle="yes">Adv. Neural Inf. Proces. Syst.</italic>
</source>
                    <year>2017</year>;<volume>30</volume>.</mixed-citation>
            </ref>
            <ref id="ref26">
                <label>26</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wang</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hu</surname>
                            <given-names>P</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Zheng</surname>
                            <given-names>H</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Integrative modeling of heterogeneous soil salinity using sparse ground samples and remote sensing images.</article-title>
                    <source>

                        <italic toggle="yes">Geoderma.</italic>
</source>
                    <year>2023</year>;<volume>430</volume>:<fpage>116321</fpage>.
                    <pub-id pub-id-type="doi">10.1016/j.geoderma.2022.116321</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref27">
                <label>27</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Cox</surname>
                            <given-names>DR</given-names>
                        </name>
</person-group>:
                    <article-title>The regression analysis of binary sequences.</article-title>
                    <source>

                        <italic toggle="yes">Journal of the Royal Statistical Society Series B: Statistical Methodology.</italic>
</source>
                    <year>1958</year>;<volume>20</volume>(<issue>2</issue>):<fpage>215</fpage>&#x2013;<lpage>232</lpage>.
                    <pub-id pub-id-type="doi">10.1111/j.2517-6161.1958.tb00292.x</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref28">
                <label>28</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Guyon</surname>
                            <given-names>I</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Elisseeff</surname>
                            <given-names>A</given-names>
                        </name>
</person-group>:
                    <article-title>An introduction to variable and feature selection.</article-title>
                    <source>

                        <italic toggle="yes">J. Mach. Learn. Res.</italic>
</source>
                    <year>2003</year>;<volume>3</volume>:<fpage>1157</fpage>&#x2013;<lpage>1182</lpage>.</mixed-citation>
            </ref>
            <ref id="ref29">
                <label>29</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Pedregosa</surname>
                            <given-names>F</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Scikit-learn: Machine learning in Python.</article-title>
                    <source>

                        <italic toggle="yes">J. Mach. Learn. Res.</italic>
</source>
                </mixed-citation>
            </ref>
            <ref id="ref30">
                <label>30</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Chavan</surname>
                            <given-names>SS</given-names>
                        </name>
</person-group>:
                    <article-title>Combined heart patient data.</article-title>
                    <source>

                        <italic toggle="yes">Mendeley Data.</italic>
</source>
                    <year>2026</year>;<volume>V1</volume>.
                    <pub-id pub-id-type="doi">10.17632/v54h5d5pvt.1</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report490357">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.196399.r490357</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Chauhan</surname>
                        <given-names>Alok Singh</given-names>
                    </name>
                    <xref ref-type="aff" rid="r490357a1">1</xref>
                    <role>Referee</role>
                </contrib>
                <aff id="r490357a1">
                    <label>1</label>Galgotias University, Greater Noida, Uttar Pradesh, India</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>15</day>
                <month>6</month>
                <year>2026</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2026 Chauhan AS</copyright-statement>
                <copyright-year>2026</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport490357" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.178061.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve-with-reservations</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The article presents a machine-learning-based approach for early cardiovascular disease prediction using a combined dataset of 1303 patient records. The study compares several models, including regression-based models, Na&#x00ef;ve Bayes, Random Forest, Logistic Regression, XGBoost, and LightGBM, with feature selection based on correlation thresholds and chi-square testing. The reported best performance is achieved by Random Forest using chi-square-selected features, with approximately 98% accuracy.</p>
            <p> The work is generally understandable and addresses an important healthcare problem. However, the presentation requires improvement. There are repeated statements, grammatical issues, and some inconsistent reporting of the number of models used. The literature review includes some recent studies, but the references are not consistently formatted, and some citations appear incomplete or unclear. The authors should revise the manuscript for language quality, remove repetition, and ensure that all references are accurate, current, and properly cited.</p>
            <p> The study design is partly appropriate, but several technical concerns must be addressed. The authors combined two datasets with different origins, but the harmonization process is not explained in sufficient detail. It is unclear how differences in feature definitions, missing variables, and target-label mapping were handled. The authors also report very high accuracy, which may indicate possible overfitting or data leakage. The paper should clearly describe the train-test split, random seed, cross-validation strategy, preprocessing pipeline, and whether feature selection was performed only on the training set.</p>
            <p> The methods and analysis are not yet sufficiently detailed for full replication. Important details such as hyperparameter settings for all models, software versions, class distribution, preprocessing steps, encoding of categorical variables, imputation strategy, and validation procedure should be added. The authors should also provide the exact dataset preparation workflow and clarify how missing values were handled, since the current description is not sufficiently rigorous.</p>
            <p> The statistical analysis and interpretation are partly appropriate. Accuracy, precision, recall, and F1-score are useful metrics, but for a medical prediction problem the authors should also report ROC-AUC, sensitivity, specificity, confidence intervals, and preferably cross-validation results. Since the dataset is described as imbalanced, relying mainly on accuracy may be misleading. Statistical comparison between models would strengthen the results.</p>
            <p> The source data appear to be available through Mendeley Data and public repositories, which supports reproducibility. However, the authors should ensure that the exact combined dataset, preprocessing code, model-training scripts, and feature-selection code are openly available.</p>
            <p> The conclusions are partly supported by the reported results, but they should be stated more cautiously. The authors should avoid implying direct clinical applicability unless external validation is performed. To make the article scientifically sound, the authors must address the dataset-merging procedure, prevent possible data leakage, add stronger validation such as k-fold cross-validation or external validation, provide complete methodological details, and improve reporting quality. After these revisions, the study may provide a useful comparative analysis of machine-learning models for cardiovascular disease prediction.</p>
            <p>Is the work clearly and accurately presented and does it cite the current literature?</p>
            <p>Partly</p>
            <p>If applicable, is the statistical analysis and its interpretation appropriate?</p>
            <p>Partly</p>
            <p>Are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Is the study design appropriate and is the work technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions drawn adequately supported by the results?</p>
            <p>Partly</p>
            <p>Are sufficient details of methods and analysis provided to allow replication by others?</p>
            <p>Partly</p>
            <p>Reviewer Expertise:</p>
            <p>Artificial IntelligenceMachine LearningData MiningMedical Informatics / Healthcare AnalyticsPredictive ModelingComputational Intelligence</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.</p>
        </body>
    </sub-article>
</article>
