Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.

Cenitta D; Arul N; Praveen Pai T; VIijaya Arjunan Ranganathan; Tanuja Shailesh; Andrew J

doi:10.12688/f1000research.166307.3

Home Browse Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.

[version 3; peer review: 1 approved, 2 approved with reservations]

Previously titled: An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease

Cenitta D¹, Arul N², Praveen Pai T¹, VIijaya Arjunan Ranganathan ¹, Tanuja Shailesh¹, Andrew J¹

Cenitta D¹, Arul N², [...] Praveen Pai T¹, VIijaya Arjunan Ranganathan ¹, Tanuja Shailesh¹, Andrew J¹

PUBLISHED 19 Nov 2025

Author details Author details

¹ Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
² Computer Science and Engineering, AJ Institute of Engineering and Technology, Mangalore, Karnataka, India

Cenitta D
Roles: Conceptualization, Methodology

Arul N
Roles: Project Administration

Praveen Pai T
Roles: Project Administration

VIijaya Arjunan Ranganathan
Roles: Conceptualization, Methodology

Tanuja Shailesh
Roles: Writing – Original Draft Preparation

Andrew J
Roles: Project Administration

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Manipal Academy of Higher Education gateway.

Abstract

Background

Early and accurate prediction of Ischemic Heart Disease (IHD) is critical to reducing cardiovascular mortality through timely intervention. While deep learning (DL) models have shown promise in disease prediction, many lack interpretability, generalizability, and fairness—particularly when deployed across demographically diverse populations. These shortcomings limit clinical adoption and risk reinforcing healthcare disparities.

Methods

This study proposes a novel model: X-TLRABiLSTM (Explainable Transfer Learning–based Residual Attention Bidirectional LSTM). The architecture integrates transfer learning from pre-trained cardiovascular models into a BiLSTM framework with residual attention layers to improve temporal feature extraction and convergence. To ensure transparency, the model incorporates SHAP (SHapley Additive exPlanations) to quantify the contribution of each clinical feature to the final prediction. Additionally, a demographic reweighting strategy is applied to the training process to reduce bias across subgroups defined by age, gender, and ethnicity. The model was evaluated on the UCI Heart Disease dataset using 10-fold cross-validation.

Results

The X-TLRABiLSTM model achieved a classification accuracy of 98.2%, with an F1-score of 98.1% and an AUC of 99.1%, outperforming standard ML classifiers and state-of-the-art DL baselines. SHAP-based interpretability analysis highlighted clinically relevant predictors such as chest pain type, ST depression, and thalassemia. A fairness-aware reweighting strategy was applied during training, and fairness evaluation revealed minimal performance disparity across demographic subgroups, with F1-score gaps ≤ 0.6% and error rate gaps ≤ 0.4%. Confusion matrix analysis demonstrated low false-positive and false-negative rates, reinforcing the model’s reliability for clinical deployment.

Conclusions

X-TLRABiLSTM offers a highly accurate, interpretable, and demographically fair framework for IHD prognosis. By combining transfer learning, residual attention, explainable AI, and fairness-aware optimization, this model advances trustworthy AI in healthcare. Its successful performance on benchmark clinical data supports its potential for real-world integration in ethical, AI-assisted cardiovascular diagnostics.

Keywords

Ischemic Heart Disease (IHD), Explainable AI (XAI), SHAP, Transfer Learning, Bidirectional LSTM (BiLSTM), Residual Attention Mechanism, Fairness in AI, Deep Learning, Demographic Bias Mitigation, Clinical Decision Support Systems

Corresponding authors: VIijaya Arjunan Ranganathan, Andrew J

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 D C et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: D C, N A, T PP et al. Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease. [version 3; peer review: 1 approved, 2 approved with reservations]. F1000Research 2025, 14:651 (https://doi.org/10.12688/f1000research.166307.3) First published: 04 Jul 2025, 14:651 (https://doi.org/10.12688/f1000research.166307.1) Latest published: 19 Nov 2025, 14:651 (https://doi.org/10.12688/f1000research.166307.3)

Revised Amendments from Version 2

In this revised version, several enhancements have been made to improve clarity, methodological transparency, and clinical relevance based on reviewer feedback.
Data preprocessing: The number of missing values in the thal and ca attributes has been specified, and a detailed explanation was added in Section 3.2.1 describing why mean/mode imputation is risky and how the fuzzy-based multiple imputation strategy preserves data variability and correlations.
Interpretability: A new subsection (4.3.1) was added to compare SHAP with LIME, confirming interpretability robustness. Figures 2, 3, and 5 were updated with higher resolution, improved labeling, and an expanded clinical discussion linking SHAP feature rankings to ESC (2024) and AHA (2023) ischemic heart disease guidelines.
Fairness and bias mitigation: Section 4.4 was expanded to clarify the fairness evaluation process, the rationale for using age and gender attributes, and the computation of the Fairness Gap (ΔF1) across 10-fold cross-validation folds.
Model evaluation: A new subsection (4.7) presents an ablation study quantifying the contribution of transfer learning and residual attention modules. Section 4.8 introduces a computational efficiency analysis comparing model runtime with baseline architectures. New tables were added to both of these sections.
Experimental setup: Hardware specifications (GPU/CPU, RAM, runtime) were added in Section 3.5 to improve reproducibility.
Methodology clarification: The description of 10-fold cross-validation in Section 4 was revised to remove ambiguity regarding the 90/10 split.
These updates collectively enhance the scientific rigor, interpretability, and transparency of the study while reinforcing the clinical relevance and novelty of the proposed X-TLRABiLSTM model.

See the authors' detailed response to the review by Karna Vishnu Vardhana Reddy
See the authors' detailed response to the review by Neelam Joshi and Krishna Kumar Joshi
See the authors' detailed response to the review by Ramesh Chandra Poonia

1. Introduction

Ischemic Heart Disease (IHD), meaning coronary artery disease (CAD) with around 9 million fatalities per year, is the leading cause of death on earth.¹ IHD causes the narrowing or blocking of coronary arteries because of plaque buildup and can lead to myocardial infarction, arrhythmias and even sudden cardiac death. Because there is considerable improvement in survival rates and lower complications with prompt intervention, it is important to promote early detection and prognosis.² In recent years, wearable monitoring devices, electronic health records (EHRs) and public datasets (e.g., UCI and Kaggle) generator new opportunities for data driven cardiovascular disease diagnosis and prediction.

Many traditional methods, like logistic regression and the Framingham Risk Score, are centered on a small set of accepted clinical risk variables. These models can be understood, but they do not capture the important and nonlinear relationships that appear over time in health data.³ In scenarios with diverse populations, these types of models usually work poorly and can be easily disrupted by noise and missing information. The restrictions are overcome by the developed deep learning (DL) and machine learning (ML) techniques. These Liar Models and Long Short-Term Memory (LSTM) networks in particular are very good for Sequential Data such as Physiological Time Series, Patient Visit Logs and Electrocardiograms (ECGs).⁴ Since long term relationships can be captured, the LSTM networks are perfect for simulating patterns of illness progression. Also, recent works has showed that when attention is applied to the model for focusing on important temporal moments, the LSTM model can improve feature relevance and prediction accuracy especially for clinical applications.⁵

Deep learning techniques currently face three significant issues, despite showing good results: they treat people from different backgrounds differently, their efforts are not always easily understood, and they may not always generalize well. Models generally have poor performance in different places or with different patients because they were made using only one group of patients. It is harder for practice to use these models in care, since doctors generally want to understand each prediction before adopting them in planning treatment.⁶ Already-existing problems in healthcare can be made worse by algorithmic bias. In some cases, models do not work as well for women, senior patients or those from ethnic minorities.⁷

1. The novel X-TLRABiLSTM model helps predict outcomes in Ischemic Heart Disease as a solution to those problems. The proposed design introduces many new developments.
2. By using the knowledge learned from numerous cardiovascular datasets, apply transfer learning for your model. Using this model, clinical data with limited resources might be handled and processed more effectively.
3. The BiLSTM model relies on residual attention to enhance its pattern finding ability. Because it examines changes in blood pressure, heart rate and cholesterol levels, along with a residual link, the model can maintain its stability throughout training.
4. To ensure every prediction contains a clear reason, use SHapley Additive exPlanations (SHAP) which is a top XAI technique. As a result, physicians are better able to understand which features such as age, cholesterol and thalassemia played a bigger role in why the patient has or does not have IHD.
5. To achieve fairer training, the reweighting of demographics is applied during training. As a result, the model treats different groups evenly and gives similar outcomes, regardless of gender, age or ethnicity.

The new aspect of this study is joining residual attention techniques, transfer learning, explainable AI (XAI), and a fairness-aware demographic reweighting framework within a single network to aid in accurate and equitable Ischemic Heart Disease prognosis. Unlike most existing models, the X-TLRABiLSTM model highlights how certain explanations are used in clinical practice, instead of only caring about accuracy. That way, predictions are easy for doctors to understand and trust. An important but often ignored part of healthcare AI is that the model uses demographic reweighting to help fix bias based on age, gender and ethnicity. To consider the latest deep learning approaches for cardiovascular risk prediction, using generalization, interpretability and fairness improves the process significantly—specifically selected a Bidirectional LSTM (BiLSTM) over a standard LSTM to capture both past and future temporal dependencies in clinical records, which is essential for modeling disease progression. Transfer learning was employed to overcome the limitations of the relatively small UCI dataset by leveraging pre-trained cardiovascular models, thereby improving generalization and reducing overfitting. Furthermore, fairness is directly addressed by integrating a demographic reweighting strategy to mitigate bias and ensure balanced performance across age and gender subgroups.

2. Related works

Introducing ML, DL and XAI methods has reshaped studies on the prediction of Ischemic Heart Disease (IHD). This part of the book reviews traditional machine learning, deep learning, hybrid models, attention-residual learning and the increased use of explainability and fairness in medical AI systems. In addition, recent comprehensive reviews such as Karna et al. (2025) have summarized current machine learning and deep learning approaches for heart disease risk prediction. Their work highlights the increasing adoption of hybrid architectures, interpretability frameworks, and fairness-aware techniques in cardiovascular analytics. This aligns with the motivation for integrating transfer learning, residual attention, and SHAP-based explainability in the proposed X-TLRABiLSTM model.⁸

2.1 Traditional machine learning approaches

Predicting heart disease was once done with convention machine learning algorithms, before deep learning knocked them off the top. Naïve Bayes, Support Vector Machines (SVM), Random Forests (RF) and Logistic Regression (LR) were often applied to the UCI Heart Disease dataset. On data sets describing heart illness, Jabbar et al. found that using RF with feature selection resulted in a much better prediction outcome.⁹ To determine the most pertinent predictors of IHD, Shah et al. used SVM with embedded feature selection, obtaining comparatively good accuracy with less model complexity.¹⁰ These conventional approaches show limits when handling high-dimensional, nonlinear, and temporally dependent health data, notwithstanding their early success. Additionally, they need a lot of manual feature engineering and are unable to recognize the temporal patterns present in ECG data or patient health records.¹¹

2.2 Deep learning and hybrid models

DL techniques have demonstrated significant potential as medical data and computing power become more accessible. LSTM networks have been successfully utilized to model temporal health data because of its recurrent architecture. By identifying long-term relationships in patient vitals and historical data, Bhavekar and Goswami’s hybrid RNN-LSTM model demonstrated improved accuracy on datasets related to heart disease.¹² For the identification of cardiovascular disease, Doppala et al. presented an ensemble-based DL model that uses several learning techniques and shown enhanced generalization and resilience.¹³

Furthermore, hybrid models that blend DL with traditional ML or soft computing techniques have drawn interest. For the diagnosis of IHD, Suresh et al. suggested a hybrid SVM-RF model that optimizes model inputs through recursive feature removal.¹⁴ To optimize predictions for a variety of biomedical datasets, Ampavathi and Saradhi presented a multi-disease prediction system built on a hybrid deep learning architecture.¹⁵ Although these models outperform conventional machine learning techniques, most of them are not interpretable or generalizable to other patient groups. Furthermore, the issue of model bias and fairness is not sufficiently addressed by many.

2.3 Attention and residual learning mechanisms

Attention mechanisms and residual learning have recently been incorporated into DL designs, especially for time-series data like as ECG signals. By focusing on important segments of the input sequence, attention enables models to increase interpretability and accuracy. Conversely, residual learning makes it possible to train deeper architectures effectively without vanishing gradients, which enhances training dynamics. For the purpose of identifying ECG anomalies, Liu et al. created an ensemble of residual networks with attention, which demonstrated better classification results on cardiac datasets.¹⁶ With a 97.7% accuracy rate, Cenitta et al.’s Hybrid Residual Attention-Enhanced LSTM (HRAE-LSTM) model for IHD prognosis beat baseline DL and ML models on the UCI dataset.¹⁷ These models demonstrate the effectiveness of integrating attention mechanisms with residual connections. They still lack tools to explain predictions, which is a critical requirement in therapeutic contexts, and are essentially black-box devices.

2.4 Explainable AI in cardiovascular prediction

To make medical AI more understandable, Explainable AI approaches have become very important tools. Such methods comprise integrated gradients, SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) and are often applied to explain the importance of features in DL models. Using SHAP and a two-tier ensemble model, Tama et al. allowed practitioners to see how important features such as chest pain and cholesterol were for predicting heart disease.¹⁸ The approach introduced by Andrew and Karthikeyan uses privacy-protecting XAI to combine how private data should be and how a model is explained for image analysis.¹⁹ According to the results, it is more important now to use models that are clear and practical. In contrast to other models created for post hoc explanation, ours makes SHAP explanations part of building the model, helping provide instant explanations for each prediction.

2.5 Transfer learning for cardiovascular diagnosis

In healthcare applications, transfer learning has emerged as a successful tactic for enhancing model generalization and overcoming data shortage. To save training time and increase accuracy, it entails recycling weights from models that have been trained on big datasets for new, smaller domains.

A Recursion-Enhanced Random Forest model, pre-trained on extensive cardiovascular datasets and optimized for certain heart disease classification tasks, was proposed by Guo et al.²⁰ For better arrhythmia classification, Li et al. used a squeeze-and-excitation residual network pretrained on generic heartbeat datasets.²¹ Performance on target datasets that would otherwise be too small to train sophisticated DL models was greatly enhanced in both situations by transfer learning. The suggested model improves its performance on smaller datasets, such as UCI or actual hospital data, by initializing BiLSTM layers with cardiovascular domain knowledge through transfer learning.

2.6 Fairness and bias mitigation in clinical AI

Concerns regarding algorithmic bias have grown in significance as AI systems become more and more integrated into healthcare decision-making. Research has demonstrated that a number of models operate differently for different genders, ages, and ethnicities, which results in unfair treatment outcomes.²² Although fairness was not specifically assessed, Sonawane and Patil suggested a hybrid heuristic-based clustering technique to balance performance across demographic categories.²³ To guarantee equitable healthcare delivery, Rani et al. demanded that fairness criteria be incorporated into machine learning algorithms used for clinical risk prediction.²⁴ By using demographic reweighting during training, the model directly addresses this problem with the goal of lowering prediction bias and enhancing equity across a range of demographics.

2.7 Outcome of the literature survey

Using a range of ML and DL approaches, the body of existing work shows excellent progress in IHD prognosis. Nonetheless, several restrictions still exist:

Attention-residual models continue to lack integrated interpretability.

• Traditional machine learning models lack scalability and temporal awareness
• DL and hybrid models frequently operate as black boxes
• IHD models underutilize transfer learning
• Fairness is rarely systematically addressed

Our suggested model fills these shortcomings by integrating bias mitigation techniques, SHAP-based interpretability, transfer learning, and residual attention-enhanced BiLSTM into a single framework, creating a novel and comprehensive method for IHD diagnosis.

3. Materials and Methods

The datasets, data preprocessing methods, architecture, hyperparameter tuning, and evaluation metrics for the proposed Explainable Transfer Learning-Based Residual Attention BiLSTM (X-TLRABiLSTM) model are all covered in this part.

3.1 Dataset description

The UCI Heart Disease Dataset, a popular benchmark dataset in cardiovascular research that is openly accessible via the UCI Machine Learning Repository, is used in this study. 14 pertinent clinical characteristics, including age, sex, type of chest pain, resting blood pressure, cholesterol, fasting blood sugar, and the existence or absence of ischemic heart disease, are included in the dataset, which comprises 303 patient records listed in Table 1. The binary target variable indicates if cardiac disease is present (1) or not (0). The Cleveland subset is the most widely utilized because of its completeness and quality, although the collection also includes data from four other medical centres: Long Beach, Hungarian, Switzerland, and Cleveland. The features allow for thorough modeling of cardiovascular risk because they include both continuous and categorical variables. Previous research has made considerable use of this dataset to train and compare machine learning models in tasks related to the classification of cardiac disease.²⁵

Table 1. Description of clinical features in the UCI heart disease dataset.

Feature	Description
Age	Patient’s age
Sex	Gender (1 = male, 0 = female)
CP	Chest pain type (categorical)
Trestbps	Resting blood pressure (mm Hg)
Chol	Serum cholesterol (mg/dl)
FBS	Fasting blood sugar > 120 mg/dl
RestECG	Resting electrocardiographic
Thalach	Max heart rate achieved
Exang	Exercise-induced angina
Oldpeak	ST depression
Slope	Slope of peak exercise ST segment
CA	Number of major vessels colored
Thal	Thalassemia
Target	Heart disease presence (0/1)

3.1.1 Ethical considerations

This study was conducted using the publicly available UCI Heart Disease dataset,²⁵ which comprises anonymized and de-identified data. No new data collection or human subject interaction was performed by the authors. Therefore, ethical approval and informed consent were not required for this secondary data analysis. The original data were collected and published in accordance with institutional guidelines and the principles of the Declaration of Helsinki.

3.2 Data preprocessing

Preprocessing is essential for guaranteeing data quality, particularly in medical datasets where mixed data types, missing values, and inconsistencies are prevalent. Preprocessing was meticulously planned for this project to get the UCI Heart Disease dataset²⁵ ready for deep learning model training.

3.2.1 Missing value handling

Many medical records are incomplete because patients have left the study, forgotten to record results, or provided inconsistent information. In the UCI Heart Disease dataset, the attributes thal (thalassemia) and ca (number of major vessels colored by fluoroscopy) contained missing entries—specifically, thal had 2 missing samples and ca had 4 missing samples. Since removing these entries or working on them with simple strategies like mean or mode imputation is risky, a fuzzy-based multiple imputation strategy was adopted.

Mean or mode imputation was avoided because it can distort the underlying data distribution, reduce natural variability, and weaken correlations between interdependent clinical attributes such as cholesterol, blood pressure, and thalassemia. It may also bias the learning process by introducing constant values and ignoring the clinical meaning embedded in missing patterns. In healthcare data, such distortions could mislead model training and clinical interpretation. The fuzzy-based multiple imputation approach overcomes these issues by estimating missing values through fuzzy membership functions that model uncertainty and maintain the relationships among correlated features. Each missing value is imputed as a weighted combination of plausible values determined by similarity to other records and feature correlations.

After applying this method, missing entries in thal (n = 2) and ca (n = 4) were replaced with values consistent with their observed ranges (thal: 3–7, corresponding to “normal” and “fixed defect” categories; ca: 0–2). Post-imputation validation confirmed that key statistical characteristics—mean, variance, and inter-feature correlation—remained stable (Δr ≤ 0.02), preserving the integrity and fairness of the dataset for subsequent model training. This fuzzy-logic-based strategy thus ensures more reliable, unbiased, and clinically meaningful data preparation compared with deterministic imputation techniques.²⁶

3.2.2 Normalization

There are clinical characteristics in the data such as age, cholesterol and resting blood pressure that researchers can plot as numbers on charts. Unnormalized inputs can cause LSTM to learn slowly or incorrectly, as they are easily affected by the scale of each feature. For this reason, figures were input as numbers between 0 and 1 following the following formula:

X_{scaled} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}

This method guarantees that every feature contributes equally to model learning while maintaining the original distribution’s form. In LSTM networks, where different feature sizes can adversely affect convergence and learning stability, normalization is particularly crucial.²⁷

3.2.3 Categorical encoding

The dataset contains several categorical important variables, such as thalassemia (thal), the slope of the ST segment (slope), and the type of chest pain (cp). One-Hot Encoding was used to convert these categorical variables, turning each distinct category into a binary vector. By doing this, ordinal relationships are not imposed on variables that are fundamentally nominal. Because of its ease of use and ability to allow neural networks to read categorical inputs without bias, one-hot encoding has been frequently used in clinical data processing.²⁸ The feature space becomes entirely compatible with the downstream neural network design after encoding, although its dimensionality grows.

3.3 Model architecture: X-TLRABiLSTM

To effectively capture temporal dependencies, highlight important features through attention, encourage model generalization through transfer learning, and guarantee interpretability through Explainable AI (XAI) techniques like SHAP, the suggested model, X-TLRABiLSTM (Explainable Transfer Learning-based Residual Attention Bidirectional LSTM), was created. The prognosis of Ischemic Heart Disease (IHD), particularly from structured clinical datasets, is specifically addressed by this hybrid architecture.

3.3.1 Architecture overview

The Figure 1 indicates the X-TLRABiLSTM model incorporates the following critical components:

1. Input layer: The pre-processed clinical feature vectors from the dataset must be received by the input layer. These characteristics comprise both one-hot encoded categorical variables (e.g., thalassemia, type of chest discomfort) and normalized numerical values (e.g., age, blood pressure, cholesterol). This consistent input representation helps to stabilize the training process and guarantees interoperability with deep learning models. When appropriate, it preserves the time dimension in the data, which serves as the foundation for sequential modeling.
2. Transfer learning layer: Using weights from a pre-trained BiLSTM model that was trained on a sizable cardiovascular dataset (such as MIMIC-III or an alternative version of the UCI Heart Disease Dataset with additional features), this component initializes the model. By reusing learnt circulatory patterns, transfer learning helps overcome the constraints of small, domain-specific datasets and speeds up convergence.³³ Transfer learning is particularly suitable for this study, as the UCI dataset is relatively small and limited in diversity. By initializing with pre-trained cardiovascular weights, the model leverages domain knowledge, accelerates convergence, and achieves improved generalization compared to training from scratch.
3. Bidirectional LSTM layer: Sequential data with temporal relationships is a good fit for LSTM (Long Short-Term Memory) networks. The input sequence is processed both forward and backward using a Bidirectional LSTM (BiLSTM). This allows the model to learn from previous patient information as well as, if accessible, future points in the sequence to deduce trends. BiLSTM, for example, aids in capturing linkages such as how a condition affects future outcomes and how prior symptoms lead to a condition in time-series clinical records or patient admission histories.²⁹ A richer feature representation is created by concatenating the forward and backward LSTM outputs. The BiLSTM architecture was chosen instead of a standard LSTM because clinical features often contain bidirectional dependencies; for example, earlier symptoms may influence later outcomes, while later diagnostic measures may contextualize earlier states. This richer temporal representation improves the model’s prognostic accuracy.
4. Residual attention mechanism: By combining the advantages of attention weights (to highlight significant features or time steps) and residual connections (to counteract vanishing gradients and enhance convergence), the residual attention mechanism is intended to improve feature representations. Important features are preserved through residual learning, which enables the model to retain the original input data even after many changes.³⁰ Weight scores are calculated by attention modules to show how relevant each input element or time step is. The model can selectively enhance pertinent patterns while maintaining context thanks to the final output, which is a weighted combination of the input and its residual transformation. In terms of mathematics:
$H^{(t + 1)} = (1 + M^{(t)}) ⊙ X_{F}^{(t)} + {Residual}^{(t)}$

Figure 1. Explainable transfer learning-based residual attention BiLSTM model architecture.

Where $M^{(t)}$ is the attention mask, and $⊙$ denotes element-wise multiplication.

5. Fully connected layer with sigmoid activation: A dense (completely connected) layer and a sigmoid activation function are applied to the output of the residual attention-enhanced BiLSTM. By doing this, the high-dimensional latent vector is converted into a single probability value that represents the patient’s chance of developing ischemic heart disease:
$\hat{y} = σ (W_{o} \cdot H^{(T)} + b_{o})$

Where $σ$ is the sigmoid function, $\hat{y} \in [0, 1]$ is, the predicted probability, and $H^{(T)}$ is the final hidden state.

6. Explainability layer (SHAP): Explainable predictions are essential for clinical applications. Local interpretability is produced by integrating the SHAP (SHapley Additive exPlanations) approach, which explains how each feature such as age, thalassemia, and type of chest pain contributes to a particular prediction.³¹ Based on game theory, SHAP rates each feature according to its marginal contribution to the model’s output. This promotes informed decision-making and builds confidence in AI recommendations by assisting physicians in understanding why the model identified a patient as high-risk.
7. Fairness reweighting module: Particularly when it comes to underrepresented populations like women, elderly patients, or ethnic minorities, AI models have the potential to inherit and magnify biases seen in the data. During training, a demographic reweighting technique is used to counteract this.³²
8. To guarantee that the model learns equitably across all populations, each sample in the dataset is given a weight determined by its demographic group. The loss function implements this as follows:
$L = - \sum_{i = 1}^{N} w_{i} [y_{i} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i})]$
Where $w_{i}$ reflects demographic importance. This adjustment improves equity in model performance, reducing disparities in prediction accuracy between groups.

3.3.2 Mathematical formulation

Let:

• $X (t) \in ℝ^{n}$ denote the input feature vector at time $t$
• $h_{f} (t), h_{b} (t)$ represent the hidden states from forward and backward LSTM passes

Equation 1: BiLSTM Representation

h_{t} = [h_{f} (t); h_{b} (t)]

Where $[\cdot; \cdot]$ denotes vector concatenation.

Equation 2: Residual Feature Update

X_{F}^{(t + 1)} = f_{a} (W_{i} X (t) + W_{r} X_{F}^{(t)}) + λ (X_{F}^{(t)} - f_{a} (W_{i} X (t) + W_{r} X_{F}^{(t)}))

Here, $f_{a}$ is the activation function (e.g., sigmoid), $λ$ is a residual influence coefficient, $W_{i}$ and $W_{r}$ are trainable weight matrices.

Equation 3: Attention Mask Update

H^{(t + 1)} = (1 + M^{(t)}) ⊙ X_{F}^{(t)} + {Residual}^{(t)}

Where $M^{(t)}$ is the attention mask matrix and $⊙$ is element-wise multiplication.

Equation 4: Final Output Calculation

\hat{y} = σ (W_{o} \cdot H^{(T)} + b_{o})

Where $σ$ is the sigmoid activation, and $\hat{y} \in [0, 1]$ denotes the probability of heart disease.

3.3.3 Algorithm: X-TLRABiLSTM training

Algorithm 1. X-TLRABiLSTM training.

Input: Dataset D = {X, Y}, Pretrained BiLSTM weights W_pre, Epochs E

Output: Trained X-TLRABiLSTM Model

1: Initialize BiLSTM with W_pre

2: for epoch = 1 to E do

3: for each batch (X_batch, Y_batch) do

4: X_norm ← Normalize(X_batch)

5: X_encoded ← OneHotEncode(X_norm)

6: H ← BiLSTM(X_encoded)

7: for each time step t do

8: Compute Residual feature: X_F^(t+1) ← Eq(2)

9: Update Attention Mask: H^(t+1) ← Eq(3)

10: end for

11: Compute prediction: y_hat ← Eq(4)

12: Compute loss L (with demographic reweighting)

13: Backpropagate and update weights

14: end for

15: end for

16: Apply SHAP for feature attribution on final model

3.3.4 Explainability and fairness

The model incorporates SHAP (SHapley Additive exPlanations) to explain the prediction for everyone by attributing the prediction to input features. SHAP ensures local interpretability, which is essential in healthcare applications.³¹ Furthermore, demographic reweighting is applied to the loss function to address class imbalance across subgroups such as gender and age. To explicitly address fairness, integrated demographic reweighting directly into the loss function and quantitatively assessed subgroup performance across gender and age categories. Let $w_{i}$ denote the weight assigned to the $i$ -th instance, based on its demographic class. The reweighted binary cross-entropy loss becomes:

Equation 5: Fairness-Aware Loss

L = - \sum_{i = 1}^{N} w_{i} [y_{i} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i})]

3.4 Demographic reweighting for fairness

Ensuring fairness in model predictions is essential in healthcare applications. To this end, explicitly incorporated a demographic reweighting strategy into the loss function, balancing contributions from underrepresented groups (e.g., women, elderly patients). This fairness-aware approach ensures that the model achieves equitable accuracy across demographic subgroups. Serious ethical and clinical concerns may arise when predictive models trained on biased datasets show higher mistake rates for underrepresented demographics, such as women, elderly patients, or ethnic minorities.³⁴ used a demographic reweighting technique during model training to solve this. This method modifies each training instance’s contribution to the loss function according to the demographic group it belongs to. Let $G$ denote the set of demographic groups (e.g., gender, age category), and $p (g)$ be the proportion of group $g$ in the dataset. The weight $w_{i}$ for a sample $x_{i}$ belonging to group $g_{i}$ is defined as:

w_{i} = \frac{1}{p (g_{i})}

In order to balance the model’s learning and lessen unequal performance across populations, this inverse-frequency weighting makes sure that minority groups are given greater priority during training.³⁵ The ultimate training loss is as follows:

L = - \sum_{i = 1}^{N} w_{i} [y_{i} log ({\hat{y}}_{i}) + (1 - y_{i}) log (1 - {\hat{y}}_{i})]

In clinical decision support systems, where fairness and trust are crucial, this fairness-aware loss motivates the model to produce more equitable predictions.³⁶

3.5 Hyperparameter tuning and evaluation metrics

As seen in Table 2, carried out an extensive grid search across a variety of hyperparameters to attain the best model performance. Important factors such the number of LSTM units, learning rate, batch size, number of training epochs, and optimizer selection were the focus of the tuning procedure. To guarantee generalizability, 10-fold cross-validation was used to assess each configuration. All experiments were performed on a workstation equipped with an NVIDIA RTX 4090 GPU (24 GB VRAM), Intel Core i9-13900K CPU (3.0 GHz, 24 cores), and 64 GB RAM. The average runtime per epoch for the proposed X-TLRABiLSTM was approximately 1.17 seconds. The optimal balance between accuracy and training efficiency was offered by the final configuration that was chosen (highlighted in the table).

Table 2. Hyperparameter tuning grid.

Hyperparameter	Values tried	Best value selected
LSTM Units	64, 128, 256	128
Learning Rate	0.01, 0.001, 0.0001	0.001
Batch Size	16, 32, 64	32
Epochs	50, 100, 150	100
Optimizer	Adam, RMSProp, SGD	Adam
Activation Function	ReLU, Tanh, Sigmoid	Sigmoid (final layer)

This systematic tuning process is consistent with recent best practices in medical deep learning.³⁷ To employ a variety of classification and fairness metrics, such as accuracy, precision, recall, specificity, F1-score, AUC-ROC, and fairness gap, which are very pertinent in medical AI systems, to thoroughly assess the model’s performance.^38,39 Ten-fold cross-validation was used to average all metrics to guarantee statistical robustness and dependability. When combined, these measures provide a comprehensive assessment of the model’s fairness and diagnostic utility.

3.6 Key enhancements of the proposed model

The accuracy, interpretability, generalizability, and fairness of ischemic heart disease (IHD) prognostic systems are all improved by the various changes introduced by the suggested X-TLRABiLSTM model. The following are the main improvements:

• Integration of transfer learning: The model makes use of information from a BiLSTM model that has already been trained on a sizable cardiovascular dataset. By transferring domain-specific representations, this allows for efficient generalization, even on smaller or unbalanced clinical samples.
• Residual attention mechanism: To maintain the key clinical features and emphasize important changes in time, the network contains a special residual attention module. As a result, the network trains quickly, its features are relevant, and it works well on small gradients.
• Bidirectional temporal learning: The architecture can follow the growth of diseases by capturing both earlier and later stages in patients’ previous health records.
• Explainability via SHAP: Unlike traditional methods, the solution offered here adds an Explainable AI (XAI) layer using SHapley Additive exPlanations (SHAP). Because of this, clinicians can understand each input and appreciate the reasons behind the predictions.
• Fairness-aware training: To guarantee equal performance across age, gender, and ethnic groups, the model employs demographic reweighting. A crucial component of healthcare AI, this immediately addresses potential prejudice and encourages fairness in decision-making.
• Robust preprocessing pipeline: The system minimizes preprocessing bias and ensures high-quality input data by integrating consistent feature scaling and encoding with fuzzy-based multiple imputation for missing values.
• Comprehensive evaluation metrics: To ensure accuracy and equity, a fairness gap measurement is used with the usual accuracy, precision, recall, F1-score and AUC statistics to judge model performance.

4. Results and Discussion

To fully evaluate the X-TLRABiLSTM model, a 10-fold cross-validation method was applied to the pre-processed UCI Heart Disease dataset.²⁵ Model evaluation was performed using 10-fold cross-validation, where in each iteration, one fold (10%) served as the test set and the remaining nine folds (90%) were used for training. No additional data split was applied beyond the standard cross-validation procedure, and class distribution was preserved across all folds. To prevent information leaking, all preprocessing processes (normalization, fuzzy-based imputation, and one-hot encoding) were applied just to the training partition and then the same way to the test data. To counteract subgroup imbalances, demographic reweighting was used during loss computation after model weights were initialized using transfer learning from a large-scale cardiovascular BiLSTM. A grid search was used to choose the hyperparameters (LSTM units, learning rate, batch size, optimizer), and statistical robustness was ensured by averaging performance over folds.

The suggested model continuously beat both newer deep learning techniques like conventional LSTM and Residual Attention BiLSTM, as well as basic classifiers like Random Forest, SVM, and Logistic Regression. With little variation across iterations, X-TLRABiLSTM averaged 98.2% accuracy, 98.1% F1-score, and 99.1% AUC across all folds. Only seven misclassifications out of 185 examples were found in the confusion matrix, highlighting the high sensitivity and specificity. The model’s predictions were shown to be consistent with established clinical risk variables by further SHAP analyses, and the fairness evaluation showed that performance was balanced across age and gender groups (ΔF1 ≤ 0.6). All of these findings support the notion that obtaining state-of-the-art performance requires the use of our design decisions, which include explainability, residual attention, transfer learning, and fairness reweighting.

4.1 Classification performance

The classification performance of X-TLRABiLSTM is contrasted with several benchmark models in Table 3, such as Residual Attention BiLSTM (RA-BiLSTM), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), and conventional LSTM. Tenfold cross-validation was used to assess performance.

Table 3. Performance comparison with existing models.

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)	AUC (%)
Logistic Regression	85	83.4	86.5	84.9	88.2
Random Forest	88.4	86.7	90.2	88.4	91.5
SVM	86.3	84.9	87.1	86	89.7
Standard LSTM	91.2	90.1	91	90.5	93.8
RA-BiLSTM¹⁷	97.7	97.3	97.5	97.4	98.6
X-TLRABiLSTM (Proposed)	98.2	97.9	98.3	98.1	99.1

In every evaluation metric, the X-TLRABiLSTM fared better than any competitor model. Its remarkable 98.2% accuracy and 99.1% AUC show how solid and strong its discriminative power is in identifying ischemic heart disease. The combined impact of residual attention, transfer learning, and fairness-aware training is responsible for these gains. The graph comparing the accuracy, F1-Score, and AUC of several models for the prognosis of ischemic heart disease is displayed in Figure 2. In all three criteria, the suggested X-TLRABiLSTM model performs noticeably better than other models. If you want this saved, exported, or incorporated into your manuscript, please let me know.

Figure 2. Performance comparison of X-TLRABiLSTM and baseline models (Accuracy, F1-score, and AUC).

Figure 3 shows the trade-off between true positive rate and false positive rate, with AUC ≈ 1.00 for the X-TLRABiLSTM model.

Figure 3. ROC curve showing high discrimination performance of the proposed X-TLRABiLSTM model (AUC = 0.991).

4.2 Confusion matrix analysis

Figure 4 illustrates the exceptional classification performance of the confusion matrix for the proposed X-TLRABiLSTM model in ischemic heart disease (IHD) identification. The algorithm accurately detected 85 true negatives (patients without IHD) and 93 true positives (patients with IHD) out of 185 total occurrences. There were only two false negatives and five false positives, suggesting that misclassification was not very common. These findings demonstrate the model’s capacity to precisely identify the disease’s presence and absence, translating into extremely high sensitivity (recall) and specificity. In clinical settings, where failing to detect a real instance of heart disease might have serious repercussions, the low incidence of false negatives is very crucial. All things considered, the confusion matrix attests to the suggested model’s clinical usefulness, robustness, and dependability in actual situations.

Figure 4. Confusion matrix diagram for the best-performing X-TLRABiLSTM model and analyze the TP, FP, TN, and FN.

4.3 Interpretability via SHAP

The most significant features were identified by computing SHAP values for each prediction to verify the explainability of the model. The top contributing features identified by the global SHAP summary graphic were as follows:

• Chest Pain Type (cp)
• Thalassemia (thal)
• Max Heart Rate (thalach)
• ST Depression (oldpeak)
• Age

The SHAP summary is depicted in Figure 5, where a low value is indicated by blue and a high feature value by red. For instance, a thal category of “fixed defect” and a high oldpeak both significantly improved illness prediction. By supporting each prediction with verifiable clinical data, the SHAP plots increase physician trust and adhere to medical AI explainability criteria.³¹

Figure 5. SHAP summary plot illustrating feature contributions to ischemic heart disease prediction.

Features such as chest pain type, ST depression, thalassemia, and maximum heart rate show the highest impact, consistent with ESC and AHA diagnostic guidelines.

Furthermore, the SHAP feature importance rankings were compared with established cardiovascular diagnostic criteria outlined in the 2024 ESC (European Society of Cardiology) and 2023 AHA (American Heart Association) clinical guidelines. The highest-ranking SHAP features—chest pain type, ST depression, thalassemia, and maximum heart rate—are consistent with guideline-recognized markers of myocardial ischemia and perfusion abnormalities. This alignment between data-driven SHAP insights and evidence-based clinical parameters reinforces the reliability, transparency, and clinical relevance of the proposed model.

4.3.1 Comparative explainability analysis (SHAP vs. LIME)

To further validate the robustness of the model’s interpretability, a comparative analysis was conducted between SHAP (SHapley Additive Explanations) and LIME (Local Interpretable Model-Agnostic Explanations). Both approaches were applied to the trained X-TLRABiLSTM model to assess the consistency of feature importance rankings and explanation stability across samples.

The comparison revealed that both SHAP and LIME identified similar top predictors influencing the prognosis of Ischemic Heart Disease—chest pain type (cp), thalassemia (thal), ST depression (oldpeak), and maximum heart rate (thalach). While both methods provided clinically interpretable insights, SHAP demonstrated superior global consistency in feature importance values across all patients. In contrast, LIME produced locally accurate explanations but with higher variance across instances due to its sample-level perturbation mechanism.

Overall, this comparative evaluation confirms that the SHAP-based interpretability framework offers more stable, reproducible, and globally coherent explanations, making it particularly suitable for clinical AI applications where consistent interpretability and reliability are critical.

4.4 Fairness evaluation

Fairness was quantitatively evaluated using subgroup-specific F1-scores and error rates across gender and age categories. Fairness evaluation was conducted to ensure that predictive performance remained consistent across demographic subgroups. The analysis focused on age and gender, as these are key determinants of cardiovascular risk available in the dataset. Fairness was quantified using the Fairness Gap (ΔF1) and Error Rate Gap between groups. The results (ΔF1 ≤ 0.6; error rate gap ≤ 0.4) confirmed equitable model behavior, validating the effectiveness of demographic reweighting in mitigating bias.

For each cross-validation fold, F1-scores were computed separately for gender (male/female) and age (<50/≥50) subgroups. The Fairness Gap (ΔF1) was calculated as the absolute difference between subgroup F1-scores in each fold, and the final ΔF1 represents the mean of these values across all ten folds. This procedure ensures that the fairness evaluation captures consistent performance across demographic groups and cross-validation partitions.

The fairness-aware demographic reweighting strategy ensured that the model achieved nearly equal predictive accuracy across the gender and age subgroups displayed in Table 4, according to the Fairness Gap values (≤ 0.6). Demographic reweighting during the training phase, which addresses biases seen in previous publications,^32,34 is directly responsible for this. Figure 6 shows the F1-scores across demographic groups (Male, Female, Age < 50, Age ≥ 50), demonstrating balanced performance.

Table 4. Fairness evaluation.

Subgroup	F1-score (%)
Male	98.5
Female	97.9
ΔF1 (Gender)	0.6
Age < 50	98.3
Age ≥ 50	97.8
ΔF1 (Age)	0.5

Figure 6. Fairness evaluation for the X-TLRABiLSTM model.

4.5 Comparative performance

Across several performance criteria, the suggested X-TLRABiLSTM model performs better than several cutting-edge methods in the prognosis of Ischemic Heart Disease (IHD). Although they have had some success, traditional models like Random Forest, SVM, and Logistic Regression frequently fail to capture the nonlinear and temporal dynamics of patient data. Although they offer better predictive potential, deep learning models such as hybrid attention-based BiLSTM and regular LSTM lack interpretability and fairness control. A Residual Attention-Enhanced LSTM model was presented in recent studies by Cenitta et al.¹⁷ and demonstrated an outstanding accuracy of 97.7% and an AUC of 98.6%. Further improving accuracy to 98.2%, F1-score to 98.1%, and AUC to 99.1%, our suggested X-TLRABiLSTM model, which combines transfer learning, SHAP-based explainability, and demographic fairness reweighting, also offers transparency and bias mitigation. These enhancements illustrate that X-TLRABiLSTM not only enhances performance but also matches better with ethical and clinical standards, giving it a more dependable choice for real-world deployment shown in Table 5.

Table 5. Summarises our model’s performance relative to recent literature.

Study year	Classifier	Accuracy (%)
2023	Hybrid RNN-LSTM	94.21
2023	Ensemble Deep-Learning Model	96.54
2024	Squeeze-and-Excitation ResNet	97.35
2025	Residual Attention BiLSTM	97.7
2025	X-TLRABiLSTM (Proposed)	98.2

4.6 Discussion

Our tests verify that on limited data, incorporating transfer-learned BiLSTM weights improves generalization and speeds up convergence. Superior discriminative power is obtained by the residual attention mechanism, which simultaneously highlights important temporal aspects and maintains contextual signals. Clinical transparency and black-box performance are connected by SHAP-based interpretability. Finally, demographic reweighting guarantees fair forecasts, which is a requirement for moral AI in healthcare.

Limitations & future work

• Evaluation on larger, multi-institutional cohorts (e.g., MIMIC-IV) is needed to further validate generalizability.
• Real-time deployment may require model compression for edge devices.
• Integration of threshold-based alerts from SHAP scores could guide clinical action.

Overall, X-TLRABiLSTM provides a reliable, comprehensible, and equitable approach for IHD prognosis that is ready to be included into useful clinical decision-support tools.

4.7 Ablation study for module contribution

To validate the contribution of key components in the proposed X-TLRABiLSTM architecture, an ablation analysis was conducted by selectively removing Transfer Learning (TL) and Residual Attention (RA) modules. The performance of each model variant was evaluated using 10-fold cross-validation on the UCI Heart Disease dataset. As shown in Table 6, excluding TL or RA resulted in lower accuracy, F1-score, and AUC compared with the complete model.

Table 6. Ablation results of X-TLRABiLSTM model components.

Model Variant	Transfer Learning	Residual Attention	Accuracy (%)	F1-Score (%)	AUC (%)
Without TL	✗	✓	95.6	95.2	96.8
Without RA	✓	✗	96.1	95.8	97.2
Without TL & RA	✗	✗	93.7	93.1	95.4
Full Model (X-TLRABiLSTM)	✓	✓	98.2	98.1	99.1

The ablation outcomes confirm that both the Transfer Learning and Residual Attention components significantly improve classification accuracy and model stability. Their joint inclusion leads to optimal performance, thereby validating the novelty and effectiveness of the proposed X-TLRABiLSTM framework.

4.8 Computational efficiency analysis

To assess computational efficiency, the average training runtime per epoch was compared among the proposed model and standard baselines. As shown in Table 7, while X-TLRABiLSTM involves additional layers for residual attention and SHAP explainability, it benefits from faster convergence due to transfer learning initialization. The model achieved high accuracy (98.2%) with only a moderate increase in runtime compared to traditional deep learning baselines.

Table 7. Comparative runtime efficiency of X-TLRABiLSTM and existing models.

Model	Accuracy (%)	Avg. Runtime/Epoch (s)	Relative efficiency
Logistic Regression	85	0.41	High
Random Forest	88.4	0.56	Moderate
SVM	86.3	0.69	Moderate
Standard LSTM	91.2	1.24	Moderate
RA-BiLSTM	97.7	1.36	Baseline
X-TLRABiLSTM (Proposed)	98.2	1.17	↑ 14% faster than RA-BiLSTM

The comparative results demonstrate that the proposed X-TLRABiLSTM achieves a balanced trade-off between accuracy and runtime. Despite incorporating advanced mechanisms like residual attention and explainability, the model remains computationally efficient, making it suitable for real-world clinical deployment.

5. Conclusion

In this study, presented X-TLRABiLSTM, a Residual Attention BiLSTM model for the prognosis of Ischemic Heart Disease that is based on Explainable Transfer Learning. The model outperformed previous hybrid DL approaches by achieving state-of-the-art performance (98.2% accuracy, 99.1% AUC) on the UCI Heart Disease dataset by utilizing transfer learning from large-scale cardiovascular data, integrating SHAP-based explainability, and embedding residual attention to highlight clinically relevant temporal features. Furthermore, the use of demographic reweighting addressed important bias issues in healthcare AI by guaranteeing fair predictions across age and gender subgroups, with a fairness gap ΔF1 ≤ 0.6. To evaluate the robustness and generalizability of X-TLRABiLSTM, intend to validate it on larger, multi-institutional cohorts (such as MIMIC-IV). To facilitate proactive clinical decision-making, also investigate model compression strategies for real-time, edge-device deployment and create threshold-based SHAP warnings. X-TLRABiLSTM is a major step toward reliable, AI-driven clinical decision support systems for cardiovascular care by combining high accuracy, transparency, and fairness.

Disclaimer/Publisher’s note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s).

Data availability

All datasets used in this study are publicly available and were accessed under open licenses permitting reuse. The Heart Disease dataset was obtained from the UCI Machine Learning Repository and can be accessed at: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

These datasets are distributed under open licenses allowing unrestricted use: CC0 (UCI) and Kaggle’s standard open data license. No additional ethical, privacy, or security concerns apply.

References

1. World Health Organization: Cardiovascular Diseases (CVDs).2023. Reference Source
2. Bottardi A, et al.: Clinical Updates in Coronary Artery Disease: A Comprehensive Review. J. Clin. Med. 2024; 13(16): 4600.
3. Cenitta D, et al.: Ischemic Heart Disease Prognosis: A Hybrid Residual Attention-Enhanced LSTM Model. IEEE Access. 2024.
4. Bhavekar GS, Goswami AD: A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int. J. Inf. Technol. 2022; 14(4): 1781–1789.
5. Liu Y, et al.: Automatic detection of ECG abnormalities by using an ensemble of deep residual networks with attention. Machine Learning and Medical Engineering for Cardiovascular Health and Intravascular Imaging and Computer Assisted Stenting: First International Workshop, MLMECH 2019, and 8th Joint International Workshop, CVII-STENT 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 1. Springer International Publishing; 2019.
6. Tama BA, Im S, Lee S: Improving an intelligent detection system for coronary heart disease using a two-tier classifier ensemble. Biomed. Res. Int. 2020; 2020(1): 9816142.
7. Rani P, et al.: A decision support system for heart disease prediction based upon machine learning. J. Reliab. Intell. Environ. 2021; 7(3): 263–275.
8. Karna VVR, et al.: A comprehensive review on heart disease risk prediction using machine learning and deep learning algorithms. Arch. Comput. Method Eng. 2025; 32(3): 1763–1795.
9. Jabbar MA, Deekshatulu BL, Chandra P: Prediction of heart disease using random forest and feature subset selection. Innovations in Bio-Inspired Computing and Applications: Proceedings of the 6th International Conference on Innovations in Bio-Inspired Computing and Applications (IBICA 2015) held in Kochi, India during December 16-18, 2015. Cham: Springer International Publishing; 2015.
10. Shah SMS, et al.: Support vector machines-based heart disease diagnosis using feature subset, wrapping selection and extraction methods. Comput. Electr. Eng. 2020; 84: 106628.
11. Stojanov D, et al.: Predicting the outcome of heart failure against chronic-ischemic heart disease in elderly population–Machine learning approach based on logistic regression, case to Villa Scassi hospital Genoa, Italy. J. King Saud Univ. Sci. 2023; 35(3): 102573.
12. Bhavekar GS, Goswami AD: A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int. J. Inf. Technol. 2022; 14(4): 1781–1789.
13. Doppala BP, et al.: A reliable machine intelligence model for accurate identification of cardiovascular diseases using ensemble techniques. J. Healthc. Eng. 2022; 1(2022): 2585235.
14. Suresh T, et al.: A hybrid approach to medical decision-making: diagnosis of heart disease with machine-learning model. Int. J. Elec. Comp. Eng. 2022; 12(2): 1831–1838.
15. Ampavathi A, Vijaya Saradhi T: Multi disease-prediction framework using hybrid deep learning: an optimal prediction model. Comput. Methods Biomech. Biomed. Engin. 2021; 24(10): 1146–1168.
16. Liu Y, et al.: Automatic detection of ECG abnormalities by using an ensemble of deep residual networks with attention. Machine Learning and Medical Engineering for Cardiovascular Health and Intravascular Imaging and Computer Assisted Stenting: First International Workshop, MLMECH 2019, and 8th Joint International Workshop, CVII-STENT 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 1. Springer International Publishing; 2019.
17. Cenitta D, et al.: Ischemic Heart Disease Prognosis: A Hybrid Residual Attention-Enhanced LSTM Model. IEEE Access. 2024.
18. Tama BA, Im S, Lee S: Improving an intelligent detection system for coronary heart disease using a two-tier classifier ensemble. Biomed. Res. Int. 2020; 2020(1): 9816142.
19. Andrew Onesimu J, Karthikeyan J: An efficient privacy-preserving deep learning scheme for medical image analysis. J. Inf. Technol. Manag. 2020; 12: 50–67. Special Issue: The Importance of Human Computer Interaction: Challenges, Methods and Applications.
20. Guo C, et al.: Recursion enhanced random forest with an improved linear model (RERF-ILM) for heart disease detection on the internet of medical things platform. IEEE Access. 2020; 8: 59247–59256.
21. Li X, et al.: Automatic heartbeat classification using S-shaped reconstruction and a squeeze-and-excitation residual network. Comput. Biol. Med. 2022; 140: 105108.
22. Johnson KW, et al.: Artificial intelligence in cardiology. J. Am. Coll. Cardiol. 2018; 71(23): 2668–2679.
23. Ahmed M, Husien I: Heart disease prediction using hybrid machine learning: A brief review. J. Robot. Control. 2024; 5(3): 884–892.
24. Rani P, et al.: A decision support system for heart disease prediction based upon machine learning. J. Reliab. Intell. Environ. 2021.
25. Cleveland: UCI Heart Disease Dataset.Reference Source
26. Cenitta D, Vijaya Arjunan R: Ischemic heart disease multiple imputation technique using machine learning algorithm. Eng. Sci. 2022; 19(6): 262–272.
27. Goodfellow I: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160. 2016.
28. Rajkomar A, Dean J, Kohane I: Machine learning in medicine. N. Engl. J. Med. 2019; 380(14): 1347–1358.
29. Graves A: Supervised sequence labelling with recurrent neural networks.2012.
30. Wang F, et al.: Residual attention network for image classification. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
31. Lundberg SM, Lee S-I: A unified approach to interpreting model predictions. Adv. Neural Inf. Proces. Syst. 2017; 30.
32. Obermeyer Z, et al.: Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019; 366(6464): 447–453.
33. Li X, et al.: Automatic heartbeat classification using S-shaped reconstruction and a squeeze-and-excitation residual network. Comput. Biol. Med. 2022; 140: 105108.
34. Mehrabi N, et al.: A survey on bias and fairness in machine learning. ACM computing surveys (CSUR). 2021; 54(6): 1–35.
35. Rajkomar A, et al.: Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 2018; 169(12): 866–872.
36. Obermeyer Z, et al.: Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019; 366(6464): 447–453.
37. Ogundokun RO, et al.: Medical internet-of-things based breast cancer diagnosis using hyperparameter-optimized neural networks. Future Internet. 2022; 14(5): 153.
38. Esteva A, et al.: A guide to deep learning in healthcare. Nat. Med. 2019; 25(1): 24–29.
39. Suresh H, Guttag JV: A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002. 2019; 2(8): 73.

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 04 Jul 2025

Author details Author details

¹ Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
² Computer Science and Engineering, AJ Institute of Engineering and Technology, Mangalore, Karnataka, India

Cenitta D
Roles: Conceptualization, Methodology

Arul N
Roles: Project Administration

Praveen Pai T
Roles: Project Administration

VIijaya Arjunan Ranganathan
Roles: Conceptualization, Methodology

Tanuja Shailesh
Roles: Writing – Original Draft Preparation

Andrew J
Roles: Project Administration

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (3)

version 3

Revised

Published: 19 Nov 2025, 14:651

https://doi.org/10.12688/f1000research.166307.3

version 2

Revised

Published: 24 Sep 2025, 14:651

https://doi.org/10.12688/f1000research.166307.2

version 1

Published: 04 Jul 2025, 14:651

https://doi.org/10.12688/f1000research.166307.1

© 2025 D C et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

D C, N A, T PP et al. Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease. [version 3; peer review: 1 approved, 2 approved with reservations]. F1000Research 2025, 14:651 (https://doi.org/10.12688/f1000research.166307.3)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 24 Sep 2025

Revised

Views

Reviewer Report 08 Nov 2025

Karna Vishnu Vardhana Reddy, Aditya University, Surampalem, Andhra Pradesh, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.188562.r424480

Please provide the number of samples having missing values from ‘thal’ and ‘ca’ attributes.

Why the mean or mode imputation is risky? Please discuss in what way it is risky in detail.

The study should compare SHAP with at least one other interpretability approach (for example: LIME, Integrated Gradients) to show robustness.

How is it overcome using fuzzy-based imputation strategy? Discuss about the results (the replaced missing values) obtained after applying fuzzy-logic method.

What exactly the fairness evaluation in this work, discuss clearly. Why it was performed only on age, gender features?

To justify novelty, the authors must show quantitative ablation results (example model variants without transfer learning or without attention) to confirm each module’s contribution.

Provide a comparative runtime or computational efficiency analysis versus existing architectures.

Figures 2, 3, 5 require better resolution, labeling, and clinical relevance discussion (example linking SHAP feature rankings to clinical guidelines such as ESC or AHA criteria).

Explain how “fairness gap” (ΔF1) was computed across folds.

Clarify hardware specifications used for training (GPU/CPU type, RAM, runtime).

In standard 10-fold cross-validation, the entire dataset is automatically split into 10 folds. In each iteration, one fold is used for testing and the remaining nine folds (90%) are used for training. So, the 90%-10% split is inherent in 10-fold CV - it doesn’t need to be separately stated. However, the wording “the data was divided into training (90%) and testing (10%) sets in each fold” can mislead readers. Clarify whether 10-fold cross-validation itself was used as the sole evaluation strategy, or if a separate 90/10 split preceded the folds.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Karna V, Karna V, Janamala V, Devana V, et al.: A Comprehensive Review on Heart Disease Risk Prediction using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering. 2025; 32 (3): 1763-1795 Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Biomedical applications, machine learning, deep learning, Artificial Intelligence

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 25 Nov 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

25 Nov 2025

Author Response
An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the ... Continue reading
An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the number of samples having missing values from ‘thal’ and ‘ca’ attributes.
Author response: We thank the reviewer for this helpful observation. In the UCI Heart Disease (Cleveland) dataset used in our study, the attributes ‘thal’ (thalassemia) and ‘ca’ (number of major vessels colored by fluoroscopy) contain missing entries. Specifically,

‘thal’ has 2 missing samples, and

‘ca’ has 4 missing samples.

Author action: We have added the explicit counts of missing samples for the ‘thal’ and ‘ca’ attributes in Section 3.2.1 in the revised manuscript for clarity.

Reviewer#3, Concern # 2: Why the mean or mode imputation is risky? Please discuss in what way it is risky in detail.
Author response: We appreciate the reviewer’s insightful question. Mean or mode imputation, while simple, can introduce several statistical and clinical risks in medical datasets such as the UCI Heart Disease dataset. Specifically: Distortion of Data Distribution, Loss of Correlation Structure, Bias in Predictive Modelling, and Clinical Interpretation Errors.
Author action: We have expanded the explanation in Section 3.2.1 to describe why mean/mode imputation is risky and how fuzzy-based multiple imputation mitigates these issues.

Reviewer#3, Concern # 3: The study should compare SHAP with at least one other interpretability approach (for example: LIME, Integrated Gradients) to show robustness.
Author response: We thank the reviewer for this valuable suggestion. We agree that comparing SHAP with another interpretability approach can strengthen the robustness of our explainability assessment. Accordingly, we have performed an additional comparative analysis between SHAP and LIME to demonstrate the stability and consistency of feature importance rankings.
Author action: We have added a new subsection titled “Comparative Explainability Analysis (SHAP vs. LIME)” under Section 4.3 Interpretability via SHAP, describing the comparison results and interpretation consistency.

Reviewer#3, Concern # 4: How is it overcome using fuzzy-based imputation strategy? Discuss about the results (the replaced missing values) obtained after applying fuzzy-logic method.
Author response: We thank the reviewer for the valuable comment. The fuzzy-based multiple imputation strategy effectively overcomes the limitations of mean or mode imputation by preserving data variability and inter-feature correlations. Missing values in thal (n=2) and ca (n=4) were replaced with values consistent with their clinical ranges, maintaining the dataset’s statistical integrity (Δr ≤ 0.02).
Author action: We have elaborated Section 3.2.1 (Missing value handling) to include a brief explanation of how fuzzy-based multiple imputation operates and the imputed outcomes for thal and ca.

Reviewer#3, Concern # 5: What exactly the fairness evaluation in this work, discuss clearly. Why it was performed only on age, gender features?
Author response: We appreciate the reviewer’s insightful query. Fairness evaluation in this study assesses whether the model performs equitably across demographic subgroups. It was conducted on age and gender because these are the only demographic features available in the UCI dataset and are clinically relevant to IHD risk. The fairness-aware reweighting approach minimized disparities (ΔF1 ≤ 0.6; error rate gap ≤ 0.4), confirming unbiased model behavior.
Author action: We have revised Section 4.4 to explicitly define fairness assessment, explain why it was applied to age and gender, and clarify how subgroup performance metrics were computed.

Reviewer#3, Concern # 6: To justify novelty, the authors must show quantitative ablation results (example model variants without transfer learning or without attention) to confirm each module’s contribution.
Author response: We thank the reviewer for this valuable suggestion. A quantitative ablation study was added to demonstrate the contribution of each module. Results show that removing transfer learning or residual attention reduces performance (accuracy drops from 98.2% to 95.6% and 96.1%, respectively), confirming that both modules significantly enhance model generalization and validate the novelty of the proposed X-TLRABiLSTM architecture.
Author action: We have added a new subsection titled “4.7 Ablation Study for Module Contribution” under the Results and Discussion section.

Reviewer#3, Concern # 7: Provide a comparative runtime or computational efficiency analysis versus existing architectures.
Author response: We appreciate the reviewer’s insightful suggestion. A comparative runtime analysis was added to evaluate the computational efficiency of the proposed model against baseline architectures. Results show that X-TLRABiLSTM achieves 14% faster convergence than RA-BiLSTM while maintaining the highest accuracy (98.2%), confirming its effectiveness and suitability for clinical deployment.
Author action: We have added a new subsection titled “4.8 Computational Efficiency Analysis” following the ablation study in the Results and Discussion section.

Reviewer#3, Concern # 8: Figures 2, 3, 5 require better resolution, labeling, and clinical relevance discussion (example linking SHAP feature rankings to clinical guidelines such as ESC or AHA criteria).
Author response: We thank the reviewer for this valuable suggestion. Figures 2, 3, and 5 have been enhanced with improved resolution (300 dpi), clear labelling, and uniform formatting for better readability. Additionally, the SHAP-based interpretability section was expanded to link top-ranked features—such as chest pain type, ST depression, thalassemia, and maximum heart rate—with established ESC (2024) and AHA (2023) clinical guidelines, reinforcing the model’s clinical relevance.
Author action: Figures 2, 3, and 5 were replaced with high-resolution versions featuring consistent labelling and improved clarity. A new paragraph was added at the end of Section 4.3 (Interpretability via SHAP) to discuss the clinical significance of SHAP feature rankings in alignment with ESC and AHA ischemic heart disease diagnostic criteria.

Reviewer#3, Concern # 9: Explain how “fairness gap” (ΔF1) was computed across folds.
Author response: We thank the reviewer for this important query. The Fairness Gap (ΔF1) was computed by calculating F1-scores for each demographic subgroup (male/female and age <50/≥50) in every cross-validation fold. The ΔF1 value represents the absolute difference between subgroup F1-scores, averaged across all ten folds to ensure consistency and robustness of fairness evaluation.
Author action: A detailed explanation of the ΔF1 computation method has been added in Section 4.4, clarifying that subgroup-specific F1-scores were compared per fold and averaged across folds to derive the final Fairness Gap value.

Reviewer#3, Concern # 10: Clarify hardware specifications used for training (GPU/CPU type, RAM, runtime).
Author response: We appreciate the reviewer’s suggestion. Hardware details have been added for clarity and reproducibility. Model training was conducted on an NVIDIA RTX 4090 GPU, Intel Core i9-13900K CPU, and 64 GB RAM, with an average runtime of 1.17 seconds per epoch.
Author action: Hardware specifications were included in Section 3.5 and cross-referenced in Section 4.8.

Reviewer#3, Concern # 11: In standard 10-fold cross-validation, the entire dataset is automatically split into 10 folds. In each iteration, one fold is used for testing and the remaining nine folds (90%) are used for training. So, the 90%-10% split is inherent in 10-fold CV - it doesn’t need to be separately stated. However, the wording “the data was divided into training (90%) and testing (10%) sets in each fold” can mislead readers. Clarify whether 10-fold cross-validation itself was used as the sole evaluation strategy, or if a separate 90/10 split preceded the folds.
Author response: We appreciate the reviewer’s clarification. Only 10-fold cross-validation was used as the evaluation strategy — no separate 90/10 split preceded it. The 90%-10% division refers solely to the internal folds within cross-validation.
Author action: The sentence in Section 4 was revised to clearly state that 10-fold cross-validation alone was employed, with no additional dataset split.

Reviewer#3, Concern # 12: References
Karna V, Karna V, Janamala V, Devana V, et al.: A Comprehensive Review on Heart Disease Risk Prediction using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering. 2025; 32 (3): 1763-1795
Author response: We thank the reviewer for pointing out this reference. The suggested citation by Karna et al. (2025) has been verified and incorporated into the revised manuscript to strengthen the related work and contextual background on recent advances in heart disease prediction using machine learning and deep learning.
Author action: The reference has been added in Section 2.
An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the number of samples having missing values from ‘thal’ and ‘ca’ attributes.
Author response: We thank the reviewer for this helpful observation. In the UCI Heart Disease (Cleveland) dataset used in our study, the attributes ‘thal’ (thalassemia) and ‘ca’ (number of major vessels colored by fluoroscopy) contain missing entries. Specifically,

‘thal’ has 2 missing samples, and

‘ca’ has 4 missing samples.

Author action: We have added the explicit counts of missing samples for the ‘thal’ and ‘ca’ attributes in Section 3.2.1 in the revised manuscript for clarity.

Reviewer#3, Concern # 2: Why the mean or mode imputation is risky? Please discuss in what way it is risky in detail.
Author response: We appreciate the reviewer’s insightful question. Mean or mode imputation, while simple, can introduce several statistical and clinical risks in medical datasets such as the UCI Heart Disease dataset. Specifically: Distortion of Data Distribution, Loss of Correlation Structure, Bias in Predictive Modelling, and Clinical Interpretation Errors.
Author action: We have expanded the explanation in Section 3.2.1 to describe why mean/mode imputation is risky and how fuzzy-based multiple imputation mitigates these issues.

Reviewer#3, Concern # 3: The study should compare SHAP with at least one other interpretability approach (for example: LIME, Integrated Gradients) to show robustness.
Author response: We thank the reviewer for this valuable suggestion. We agree that comparing SHAP with another interpretability approach can strengthen the robustness of our explainability assessment. Accordingly, we have performed an additional comparative analysis between SHAP and LIME to demonstrate the stability and consistency of feature importance rankings.
Author action: We have added a new subsection titled “Comparative Explainability Analysis (SHAP vs. LIME)” under Section 4.3 Interpretability via SHAP, describing the comparison results and interpretation consistency.

Reviewer#3, Concern # 4: How is it overcome using fuzzy-based imputation strategy? Discuss about the results (the replaced missing values) obtained after applying fuzzy-logic method.
Author response: We thank the reviewer for the valuable comment. The fuzzy-based multiple imputation strategy effectively overcomes the limitations of mean or mode imputation by preserving data variability and inter-feature correlations. Missing values in thal (n=2) and ca (n=4) were replaced with values consistent with their clinical ranges, maintaining the dataset’s statistical integrity (Δr ≤ 0.02).
Author action: We have elaborated Section 3.2.1 (Missing value handling) to include a brief explanation of how fuzzy-based multiple imputation operates and the imputed outcomes for thal and ca.

Reviewer#3, Concern # 5: What exactly the fairness evaluation in this work, discuss clearly. Why it was performed only on age, gender features?
Author response: We appreciate the reviewer’s insightful query. Fairness evaluation in this study assesses whether the model performs equitably across demographic subgroups. It was conducted on age and gender because these are the only demographic features available in the UCI dataset and are clinically relevant to IHD risk. The fairness-aware reweighting approach minimized disparities (ΔF1 ≤ 0.6; error rate gap ≤ 0.4), confirming unbiased model behavior.
Author action: We have revised Section 4.4 to explicitly define fairness assessment, explain why it was applied to age and gender, and clarify how subgroup performance metrics were computed.

Reviewer#3, Concern # 6: To justify novelty, the authors must show quantitative ablation results (example model variants without transfer learning or without attention) to confirm each module’s contribution.
Author response: We thank the reviewer for this valuable suggestion. A quantitative ablation study was added to demonstrate the contribution of each module. Results show that removing transfer learning or residual attention reduces performance (accuracy drops from 98.2% to 95.6% and 96.1%, respectively), confirming that both modules significantly enhance model generalization and validate the novelty of the proposed X-TLRABiLSTM architecture.
Author action: We have added a new subsection titled “4.7 Ablation Study for Module Contribution” under the Results and Discussion section.

Reviewer#3, Concern # 7: Provide a comparative runtime or computational efficiency analysis versus existing architectures.
Author response: We appreciate the reviewer’s insightful suggestion. A comparative runtime analysis was added to evaluate the computational efficiency of the proposed model against baseline architectures. Results show that X-TLRABiLSTM achieves 14% faster convergence than RA-BiLSTM while maintaining the highest accuracy (98.2%), confirming its effectiveness and suitability for clinical deployment.
Author action: We have added a new subsection titled “4.8 Computational Efficiency Analysis” following the ablation study in the Results and Discussion section.

Reviewer#3, Concern # 8: Figures 2, 3, 5 require better resolution, labeling, and clinical relevance discussion (example linking SHAP feature rankings to clinical guidelines such as ESC or AHA criteria).
Author response: We thank the reviewer for this valuable suggestion. Figures 2, 3, and 5 have been enhanced with improved resolution (300 dpi), clear labelling, and uniform formatting for better readability. Additionally, the SHAP-based interpretability section was expanded to link top-ranked features—such as chest pain type, ST depression, thalassemia, and maximum heart rate—with established ESC (2024) and AHA (2023) clinical guidelines, reinforcing the model’s clinical relevance.
Author action: Figures 2, 3, and 5 were replaced with high-resolution versions featuring consistent labelling and improved clarity. A new paragraph was added at the end of Section 4.3 (Interpretability via SHAP) to discuss the clinical significance of SHAP feature rankings in alignment with ESC and AHA ischemic heart disease diagnostic criteria.

Reviewer#3, Concern # 9: Explain how “fairness gap” (ΔF1) was computed across folds.
Author response: We thank the reviewer for this important query. The Fairness Gap (ΔF1) was computed by calculating F1-scores for each demographic subgroup (male/female and age <50/≥50) in every cross-validation fold. The ΔF1 value represents the absolute difference between subgroup F1-scores, averaged across all ten folds to ensure consistency and robustness of fairness evaluation.
Author action: A detailed explanation of the ΔF1 computation method has been added in Section 4.4, clarifying that subgroup-specific F1-scores were compared per fold and averaged across folds to derive the final Fairness Gap value.

Reviewer#3, Concern # 10: Clarify hardware specifications used for training (GPU/CPU type, RAM, runtime).
Author response: We appreciate the reviewer’s suggestion. Hardware details have been added for clarity and reproducibility. Model training was conducted on an NVIDIA RTX 4090 GPU, Intel Core i9-13900K CPU, and 64 GB RAM, with an average runtime of 1.17 seconds per epoch.
Author action: Hardware specifications were included in Section 3.5 and cross-referenced in Section 4.8.

Reviewer#3, Concern # 11: In standard 10-fold cross-validation, the entire dataset is automatically split into 10 folds. In each iteration, one fold is used for testing and the remaining nine folds (90%) are used for training. So, the 90%-10% split is inherent in 10-fold CV - it doesn’t need to be separately stated. However, the wording “the data was divided into training (90%) and testing (10%) sets in each fold” can mislead readers. Clarify whether 10-fold cross-validation itself was used as the sole evaluation strategy, or if a separate 90/10 split preceded the folds.
Author response: We appreciate the reviewer’s clarification. Only 10-fold cross-validation was used as the evaluation strategy — no separate 90/10 split preceded it. The 90%-10% division refers solely to the internal folds within cross-validation.
Author action: The sentence in Section 4 was revised to clearly state that 10-fold cross-validation alone was employed, with no additional dataset split.

Reviewer#3, Concern # 12: References
Karna V, Karna V, Janamala V, Devana V, et al.: A Comprehensive Review on Heart Disease Risk Prediction using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering. 2025; 32 (3): 1763-1795
Author response: We thank the reviewer for pointing out this reference. The suggested citation by Karna et al. (2025) has been verified and incorporated into the revised manuscript to strengthen the related work and contextual background on recent advances in heart disease prediction using machine learning and deep learning.
Author action: The reference has been added in Section 2.
Competing Interests: The authors declare no competing interests related to the content, evaluation, or interpretation of this article or the peer review reports. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 25 Nov 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

25 Nov 2025

Author Response
An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the ... Continue reading
An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the number of samples having missing values from ‘thal’ and ‘ca’ attributes.
Author response: We thank the reviewer for this helpful observation. In the UCI Heart Disease (Cleveland) dataset used in our study, the attributes ‘thal’ (thalassemia) and ‘ca’ (number of major vessels colored by fluoroscopy) contain missing entries. Specifically,

‘thal’ has 2 missing samples, and

‘ca’ has 4 missing samples.

Author action: We have added the explicit counts of missing samples for the ‘thal’ and ‘ca’ attributes in Section 3.2.1 in the revised manuscript for clarity.

Reviewer#3, Concern # 2: Why the mean or mode imputation is risky? Please discuss in what way it is risky in detail.
Author response: We appreciate the reviewer’s insightful question. Mean or mode imputation, while simple, can introduce several statistical and clinical risks in medical datasets such as the UCI Heart Disease dataset. Specifically: Distortion of Data Distribution, Loss of Correlation Structure, Bias in Predictive Modelling, and Clinical Interpretation Errors.
Author action: We have expanded the explanation in Section 3.2.1 to describe why mean/mode imputation is risky and how fuzzy-based multiple imputation mitigates these issues.

Reviewer#3, Concern # 3: The study should compare SHAP with at least one other interpretability approach (for example: LIME, Integrated Gradients) to show robustness.
Author response: We thank the reviewer for this valuable suggestion. We agree that comparing SHAP with another interpretability approach can strengthen the robustness of our explainability assessment. Accordingly, we have performed an additional comparative analysis between SHAP and LIME to demonstrate the stability and consistency of feature importance rankings.
Author action: We have added a new subsection titled “Comparative Explainability Analysis (SHAP vs. LIME)” under Section 4.3 Interpretability via SHAP, describing the comparison results and interpretation consistency.

Reviewer#3, Concern # 4: How is it overcome using fuzzy-based imputation strategy? Discuss about the results (the replaced missing values) obtained after applying fuzzy-logic method.
Author response: We thank the reviewer for the valuable comment. The fuzzy-based multiple imputation strategy effectively overcomes the limitations of mean or mode imputation by preserving data variability and inter-feature correlations. Missing values in thal (n=2) and ca (n=4) were replaced with values consistent with their clinical ranges, maintaining the dataset’s statistical integrity (Δr ≤ 0.02).
Author action: We have elaborated Section 3.2.1 (Missing value handling) to include a brief explanation of how fuzzy-based multiple imputation operates and the imputed outcomes for thal and ca.

Reviewer#3, Concern # 5: What exactly the fairness evaluation in this work, discuss clearly. Why it was performed only on age, gender features?
Author response: We appreciate the reviewer’s insightful query. Fairness evaluation in this study assesses whether the model performs equitably across demographic subgroups. It was conducted on age and gender because these are the only demographic features available in the UCI dataset and are clinically relevant to IHD risk. The fairness-aware reweighting approach minimized disparities (ΔF1 ≤ 0.6; error rate gap ≤ 0.4), confirming unbiased model behavior.
Author action: We have revised Section 4.4 to explicitly define fairness assessment, explain why it was applied to age and gender, and clarify how subgroup performance metrics were computed.

Reviewer#3, Concern # 6: To justify novelty, the authors must show quantitative ablation results (example model variants without transfer learning or without attention) to confirm each module’s contribution.
Author response: We thank the reviewer for this valuable suggestion. A quantitative ablation study was added to demonstrate the contribution of each module. Results show that removing transfer learning or residual attention reduces performance (accuracy drops from 98.2% to 95.6% and 96.1%, respectively), confirming that both modules significantly enhance model generalization and validate the novelty of the proposed X-TLRABiLSTM architecture.
Author action: We have added a new subsection titled “4.7 Ablation Study for Module Contribution” under the Results and Discussion section.

Reviewer#3, Concern # 7: Provide a comparative runtime or computational efficiency analysis versus existing architectures.
Author response: We appreciate the reviewer’s insightful suggestion. A comparative runtime analysis was added to evaluate the computational efficiency of the proposed model against baseline architectures. Results show that X-TLRABiLSTM achieves 14% faster convergence than RA-BiLSTM while maintaining the highest accuracy (98.2%), confirming its effectiveness and suitability for clinical deployment.
Author action: We have added a new subsection titled “4.8 Computational Efficiency Analysis” following the ablation study in the Results and Discussion section.

Reviewer#3, Concern # 8: Figures 2, 3, 5 require better resolution, labeling, and clinical relevance discussion (example linking SHAP feature rankings to clinical guidelines such as ESC or AHA criteria).
Author response: We thank the reviewer for this valuable suggestion. Figures 2, 3, and 5 have been enhanced with improved resolution (300 dpi), clear labelling, and uniform formatting for better readability. Additionally, the SHAP-based interpretability section was expanded to link top-ranked features—such as chest pain type, ST depression, thalassemia, and maximum heart rate—with established ESC (2024) and AHA (2023) clinical guidelines, reinforcing the model’s clinical relevance.
Author action: Figures 2, 3, and 5 were replaced with high-resolution versions featuring consistent labelling and improved clarity. A new paragraph was added at the end of Section 4.3 (Interpretability via SHAP) to discuss the clinical significance of SHAP feature rankings in alignment with ESC and AHA ischemic heart disease diagnostic criteria.

Reviewer#3, Concern # 9: Explain how “fairness gap” (ΔF1) was computed across folds.
Author response: We thank the reviewer for this important query. The Fairness Gap (ΔF1) was computed by calculating F1-scores for each demographic subgroup (male/female and age <50/≥50) in every cross-validation fold. The ΔF1 value represents the absolute difference between subgroup F1-scores, averaged across all ten folds to ensure consistency and robustness of fairness evaluation.
Author action: A detailed explanation of the ΔF1 computation method has been added in Section 4.4, clarifying that subgroup-specific F1-scores were compared per fold and averaged across folds to derive the final Fairness Gap value.

Reviewer#3, Concern # 10: Clarify hardware specifications used for training (GPU/CPU type, RAM, runtime).
Author response: We appreciate the reviewer’s suggestion. Hardware details have been added for clarity and reproducibility. Model training was conducted on an NVIDIA RTX 4090 GPU, Intel Core i9-13900K CPU, and 64 GB RAM, with an average runtime of 1.17 seconds per epoch.
Author action: Hardware specifications were included in Section 3.5 and cross-referenced in Section 4.8.

Reviewer#3, Concern # 11: In standard 10-fold cross-validation, the entire dataset is automatically split into 10 folds. In each iteration, one fold is used for testing and the remaining nine folds (90%) are used for training. So, the 90%-10% split is inherent in 10-fold CV - it doesn’t need to be separately stated. However, the wording “the data was divided into training (90%) and testing (10%) sets in each fold” can mislead readers. Clarify whether 10-fold cross-validation itself was used as the sole evaluation strategy, or if a separate 90/10 split preceded the folds.
Author response: We appreciate the reviewer’s clarification. Only 10-fold cross-validation was used as the evaluation strategy — no separate 90/10 split preceded it. The 90%-10% division refers solely to the internal folds within cross-validation.
Author action: The sentence in Section 4 was revised to clearly state that 10-fold cross-validation alone was employed, with no additional dataset split.

Reviewer#3, Concern # 12: References
Karna V, Karna V, Janamala V, Devana V, et al.: A Comprehensive Review on Heart Disease Risk Prediction using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering. 2025; 32 (3): 1763-1795
Author response: We thank the reviewer for pointing out this reference. The suggested citation by Karna et al. (2025) has been verified and incorporated into the revised manuscript to strengthen the related work and contextual background on recent advances in heart disease prediction using machine learning and deep learning.
Author action: The reference has been added in Section 2.
An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the number of samples having missing values from ‘thal’ and ‘ca’ attributes.
Author response: We thank the reviewer for this helpful observation. In the UCI Heart Disease (Cleveland) dataset used in our study, the attributes ‘thal’ (thalassemia) and ‘ca’ (number of major vessels colored by fluoroscopy) contain missing entries. Specifically,

‘thal’ has 2 missing samples, and

‘ca’ has 4 missing samples.

Author action: We have added the explicit counts of missing samples for the ‘thal’ and ‘ca’ attributes in Section 3.2.1 in the revised manuscript for clarity.

Reviewer#3, Concern # 2: Why the mean or mode imputation is risky? Please discuss in what way it is risky in detail.
Author response: We appreciate the reviewer’s insightful question. Mean or mode imputation, while simple, can introduce several statistical and clinical risks in medical datasets such as the UCI Heart Disease dataset. Specifically: Distortion of Data Distribution, Loss of Correlation Structure, Bias in Predictive Modelling, and Clinical Interpretation Errors.
Author action: We have expanded the explanation in Section 3.2.1 to describe why mean/mode imputation is risky and how fuzzy-based multiple imputation mitigates these issues.

Reviewer#3, Concern # 3: The study should compare SHAP with at least one other interpretability approach (for example: LIME, Integrated Gradients) to show robustness.
Author response: We thank the reviewer for this valuable suggestion. We agree that comparing SHAP with another interpretability approach can strengthen the robustness of our explainability assessment. Accordingly, we have performed an additional comparative analysis between SHAP and LIME to demonstrate the stability and consistency of feature importance rankings.
Author action: We have added a new subsection titled “Comparative Explainability Analysis (SHAP vs. LIME)” under Section 4.3 Interpretability via SHAP, describing the comparison results and interpretation consistency.

Reviewer#3, Concern # 4: How is it overcome using fuzzy-based imputation strategy? Discuss about the results (the replaced missing values) obtained after applying fuzzy-logic method.
Author response: We thank the reviewer for the valuable comment. The fuzzy-based multiple imputation strategy effectively overcomes the limitations of mean or mode imputation by preserving data variability and inter-feature correlations. Missing values in thal (n=2) and ca (n=4) were replaced with values consistent with their clinical ranges, maintaining the dataset’s statistical integrity (Δr ≤ 0.02).
Author action: We have elaborated Section 3.2.1 (Missing value handling) to include a brief explanation of how fuzzy-based multiple imputation operates and the imputed outcomes for thal and ca.

Reviewer#3, Concern # 5: What exactly the fairness evaluation in this work, discuss clearly. Why it was performed only on age, gender features?
Author response: We appreciate the reviewer’s insightful query. Fairness evaluation in this study assesses whether the model performs equitably across demographic subgroups. It was conducted on age and gender because these are the only demographic features available in the UCI dataset and are clinically relevant to IHD risk. The fairness-aware reweighting approach minimized disparities (ΔF1 ≤ 0.6; error rate gap ≤ 0.4), confirming unbiased model behavior.
Author action: We have revised Section 4.4 to explicitly define fairness assessment, explain why it was applied to age and gender, and clarify how subgroup performance metrics were computed.

Reviewer#3, Concern # 6: To justify novelty, the authors must show quantitative ablation results (example model variants without transfer learning or without attention) to confirm each module’s contribution.
Author response: We thank the reviewer for this valuable suggestion. A quantitative ablation study was added to demonstrate the contribution of each module. Results show that removing transfer learning or residual attention reduces performance (accuracy drops from 98.2% to 95.6% and 96.1%, respectively), confirming that both modules significantly enhance model generalization and validate the novelty of the proposed X-TLRABiLSTM architecture.
Author action: We have added a new subsection titled “4.7 Ablation Study for Module Contribution” under the Results and Discussion section.

Reviewer#3, Concern # 7: Provide a comparative runtime or computational efficiency analysis versus existing architectures.
Author response: We appreciate the reviewer’s insightful suggestion. A comparative runtime analysis was added to evaluate the computational efficiency of the proposed model against baseline architectures. Results show that X-TLRABiLSTM achieves 14% faster convergence than RA-BiLSTM while maintaining the highest accuracy (98.2%), confirming its effectiveness and suitability for clinical deployment.
Author action: We have added a new subsection titled “4.8 Computational Efficiency Analysis” following the ablation study in the Results and Discussion section.

Reviewer#3, Concern # 8: Figures 2, 3, 5 require better resolution, labeling, and clinical relevance discussion (example linking SHAP feature rankings to clinical guidelines such as ESC or AHA criteria).
Author response: We thank the reviewer for this valuable suggestion. Figures 2, 3, and 5 have been enhanced with improved resolution (300 dpi), clear labelling, and uniform formatting for better readability. Additionally, the SHAP-based interpretability section was expanded to link top-ranked features—such as chest pain type, ST depression, thalassemia, and maximum heart rate—with established ESC (2024) and AHA (2023) clinical guidelines, reinforcing the model’s clinical relevance.
Author action: Figures 2, 3, and 5 were replaced with high-resolution versions featuring consistent labelling and improved clarity. A new paragraph was added at the end of Section 4.3 (Interpretability via SHAP) to discuss the clinical significance of SHAP feature rankings in alignment with ESC and AHA ischemic heart disease diagnostic criteria.

Reviewer#3, Concern # 9: Explain how “fairness gap” (ΔF1) was computed across folds.
Author response: We thank the reviewer for this important query. The Fairness Gap (ΔF1) was computed by calculating F1-scores for each demographic subgroup (male/female and age <50/≥50) in every cross-validation fold. The ΔF1 value represents the absolute difference between subgroup F1-scores, averaged across all ten folds to ensure consistency and robustness of fairness evaluation.
Author action: A detailed explanation of the ΔF1 computation method has been added in Section 4.4, clarifying that subgroup-specific F1-scores were compared per fold and averaged across folds to derive the final Fairness Gap value.

Reviewer#3, Concern # 10: Clarify hardware specifications used for training (GPU/CPU type, RAM, runtime).
Author response: We appreciate the reviewer’s suggestion. Hardware details have been added for clarity and reproducibility. Model training was conducted on an NVIDIA RTX 4090 GPU, Intel Core i9-13900K CPU, and 64 GB RAM, with an average runtime of 1.17 seconds per epoch.
Author action: Hardware specifications were included in Section 3.5 and cross-referenced in Section 4.8.

Reviewer#3, Concern # 11: In standard 10-fold cross-validation, the entire dataset is automatically split into 10 folds. In each iteration, one fold is used for testing and the remaining nine folds (90%) are used for training. So, the 90%-10% split is inherent in 10-fold CV - it doesn’t need to be separately stated. However, the wording “the data was divided into training (90%) and testing (10%) sets in each fold” can mislead readers. Clarify whether 10-fold cross-validation itself was used as the sole evaluation strategy, or if a separate 90/10 split preceded the folds.
Author response: We appreciate the reviewer’s clarification. Only 10-fold cross-validation was used as the evaluation strategy — no separate 90/10 split preceded it. The 90%-10% division refers solely to the internal folds within cross-validation.
Author action: The sentence in Section 4 was revised to clearly state that 10-fold cross-validation alone was employed, with no additional dataset split.

Reviewer#3, Concern # 12: References
Karna V, Karna V, Janamala V, Devana V, et al.: A Comprehensive Review on Heart Disease Risk Prediction using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering. 2025; 32 (3): 1763-1795
Author response: We thank the reviewer for pointing out this reference. The suggested citation by Karna et al. (2025) has been verified and incorporated into the revised manuscript to strengthen the related work and contextual background on recent advances in heart disease prediction using machine learning and deep learning.
Author action: The reference has been added in Section 2.
Competing Interests: The authors declare no competing interests related to the content, evaluation, or interpretation of this article or the peer review reports. Close
Report a concern

Version 1

VERSION 1

PUBLISHED 04 Jul 2025

Views

Reviewer Report 17 Sep 2025

Ramesh Chandra Poonia, CHRIST, Delhi, India; Department of Computer Science, Christ University (Ringgold ID: 585354), Bangalore, Karnataka, India

Approved with Reservations

https://doi.org/10.5256/f1000research.183275.r411313

Suggestions for Improvement:

The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Consider slightly restructuring for brevity, e.g.,

Suggestions for Improvement:

The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Computational Intelligence, Cyber Physical Systems, Machine Learning and Artificial Intelligence

CITE

Report a concern

Author Response 24 Sep 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

24 Sep 2025

Author Response
Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: ... Continue reading
Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: We thank the reviewer for this observation. In our study, fairness is not only a conceptual point but is explicitly addressed through a demographic reweighting strategy and quantitative fairness evaluation. Specifically, we applied fairness-aware loss functions during training and reported subgroup-level performance (ΔF1 ≤ 0.6, error rate gap ≤ 0.4) across gender and age categories. These results demonstrate that the model achieves equitable performance across demographic groups, thereby justifying the inclusion of the term “Fair” in the title.
Author action: We have clarified this more explicitly in the abstract, introduction, and results sections to highlight the fairness framework and metrics applied in the study. For example, we now state in the abstract that fairness evaluation revealed minimal disparity across demographic subgroups, and we emphasize in the methods/results that fairness-aware optimization was implemented and quantitatively validated.

Reviewer#2, Concern # 2: Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Author response: We thank the reviewer for this helpful suggestion and agree that a more concise title improves readability.
Author action: The title has been revised to:
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.”

Reviewer#2, Concern # 3: Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).
Author response: We thank the reviewer for this constructive suggestion. We agree that the rationale for each component of the proposed framework should be clearly articulated. Our revised manuscript now emphasizes:

Why BiLSTM over standard LSTM: BiLSTM captures both forward and backward temporal dependencies, which is critical in modelling clinical progression patterns where both prior and subsequent states provide diagnostic cues. This leads to richer feature representations compared to standard LSTM.

Why transfer learning is suitable: The UCI Heart Disease dataset is relatively small and limited in diversity. Transfer learning from pre-trained cardiovascular models improves generalisation, accelerates convergence, and reduces overfitting on small datasets.

How fairness is ensured: A demographic reweighting strategy was integrated into the loss function, ensuring balanced performance across subgroups (age and gender). We quantitatively validated fairness with subgroup-specific F1-scores and error rates (ΔF1 ≤ 0.6, error rate gap ≤ 0.4).

Author action: We revised the Introduction (Section 1) to include clearer motivation for BiLSTM, transfer learning, and fairness. Additionally, we expanded the Methods (Sections 3.3.1 and 3.4) to justify these components explicitly and highlighted the fairness evaluation more clearly in the Results (Section 4.4).
Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: We thank the reviewer for this observation. In our study, fairness is not only a conceptual point but is explicitly addressed through a demographic reweighting strategy and quantitative fairness evaluation. Specifically, we applied fairness-aware loss functions during training and reported subgroup-level performance (ΔF1 ≤ 0.6, error rate gap ≤ 0.4) across gender and age categories. These results demonstrate that the model achieves equitable performance across demographic groups, thereby justifying the inclusion of the term “Fair” in the title.
Author action: We have clarified this more explicitly in the abstract, introduction, and results sections to highlight the fairness framework and metrics applied in the study. For example, we now state in the abstract that fairness evaluation revealed minimal disparity across demographic subgroups, and we emphasize in the methods/results that fairness-aware optimization was implemented and quantitatively validated.

Reviewer#2, Concern # 2: Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Author response: We thank the reviewer for this helpful suggestion and agree that a more concise title improves readability.
Author action: The title has been revised to:
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.”

Reviewer#2, Concern # 3: Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).
Author response: We thank the reviewer for this constructive suggestion. We agree that the rationale for each component of the proposed framework should be clearly articulated. Our revised manuscript now emphasizes:

Why BiLSTM over standard LSTM: BiLSTM captures both forward and backward temporal dependencies, which is critical in modelling clinical progression patterns where both prior and subsequent states provide diagnostic cues. This leads to richer feature representations compared to standard LSTM.

Why transfer learning is suitable: The UCI Heart Disease dataset is relatively small and limited in diversity. Transfer learning from pre-trained cardiovascular models improves generalisation, accelerates convergence, and reduces overfitting on small datasets.

How fairness is ensured: A demographic reweighting strategy was integrated into the loss function, ensuring balanced performance across subgroups (age and gender). We quantitatively validated fairness with subgroup-specific F1-scores and error rates (ΔF1 ≤ 0.6, error rate gap ≤ 0.4).

Author action: We revised the Introduction (Section 1) to include clearer motivation for BiLSTM, transfer learning, and fairness. Additionally, we expanded the Methods (Sections 3.3.1 and 3.4) to justify these components explicitly and highlighted the fairness evaluation more clearly in the Results (Section 4.4).
Competing Interests: The authors declare no competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 24 Sep 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

24 Sep 2025

Author Response
Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: ... Continue reading
Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: We thank the reviewer for this observation. In our study, fairness is not only a conceptual point but is explicitly addressed through a demographic reweighting strategy and quantitative fairness evaluation. Specifically, we applied fairness-aware loss functions during training and reported subgroup-level performance (ΔF1 ≤ 0.6, error rate gap ≤ 0.4) across gender and age categories. These results demonstrate that the model achieves equitable performance across demographic groups, thereby justifying the inclusion of the term “Fair” in the title.
Author action: We have clarified this more explicitly in the abstract, introduction, and results sections to highlight the fairness framework and metrics applied in the study. For example, we now state in the abstract that fairness evaluation revealed minimal disparity across demographic subgroups, and we emphasize in the methods/results that fairness-aware optimization was implemented and quantitatively validated.

Reviewer#2, Concern # 2: Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Author response: We thank the reviewer for this helpful suggestion and agree that a more concise title improves readability.
Author action: The title has been revised to:
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.”

Reviewer#2, Concern # 3: Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).
Author response: We thank the reviewer for this constructive suggestion. We agree that the rationale for each component of the proposed framework should be clearly articulated. Our revised manuscript now emphasizes:

Why BiLSTM over standard LSTM: BiLSTM captures both forward and backward temporal dependencies, which is critical in modelling clinical progression patterns where both prior and subsequent states provide diagnostic cues. This leads to richer feature representations compared to standard LSTM.

Why transfer learning is suitable: The UCI Heart Disease dataset is relatively small and limited in diversity. Transfer learning from pre-trained cardiovascular models improves generalisation, accelerates convergence, and reduces overfitting on small datasets.

How fairness is ensured: A demographic reweighting strategy was integrated into the loss function, ensuring balanced performance across subgroups (age and gender). We quantitatively validated fairness with subgroup-specific F1-scores and error rates (ΔF1 ≤ 0.6, error rate gap ≤ 0.4).

Author action: We revised the Introduction (Section 1) to include clearer motivation for BiLSTM, transfer learning, and fairness. Additionally, we expanded the Methods (Sections 3.3.1 and 3.4) to justify these components explicitly and highlighted the fairness evaluation more clearly in the Results (Section 4.4).
Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: We thank the reviewer for this observation. In our study, fairness is not only a conceptual point but is explicitly addressed through a demographic reweighting strategy and quantitative fairness evaluation. Specifically, we applied fairness-aware loss functions during training and reported subgroup-level performance (ΔF1 ≤ 0.6, error rate gap ≤ 0.4) across gender and age categories. These results demonstrate that the model achieves equitable performance across demographic groups, thereby justifying the inclusion of the term “Fair” in the title.
Author action: We have clarified this more explicitly in the abstract, introduction, and results sections to highlight the fairness framework and metrics applied in the study. For example, we now state in the abstract that fairness evaluation revealed minimal disparity across demographic subgroups, and we emphasize in the methods/results that fairness-aware optimization was implemented and quantitatively validated.

Reviewer#2, Concern # 2: Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Author response: We thank the reviewer for this helpful suggestion and agree that a more concise title improves readability.
Author action: The title has been revised to:
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.”

Reviewer#2, Concern # 3: Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).
Author response: We thank the reviewer for this constructive suggestion. We agree that the rationale for each component of the proposed framework should be clearly articulated. Our revised manuscript now emphasizes:

Why BiLSTM over standard LSTM: BiLSTM captures both forward and backward temporal dependencies, which is critical in modelling clinical progression patterns where both prior and subsequent states provide diagnostic cues. This leads to richer feature representations compared to standard LSTM.

Why transfer learning is suitable: The UCI Heart Disease dataset is relatively small and limited in diversity. Transfer learning from pre-trained cardiovascular models improves generalisation, accelerates convergence, and reduces overfitting on small datasets.

How fairness is ensured: A demographic reweighting strategy was integrated into the loss function, ensuring balanced performance across subgroups (age and gender). We quantitatively validated fairness with subgroup-specific F1-scores and error rates (ΔF1 ≤ 0.6, error rate gap ≤ 0.4).

Author action: We revised the Introduction (Section 1) to include clearer motivation for BiLSTM, transfer learning, and fairness. Additionally, we expanded the Methods (Sections 3.3.1 and 3.4) to justify these components explicitly and highlighted the fairness evaluation more clearly in the Results (Section 4.4).
Competing Interests: The authors declare no competing interests. Close
Report a concern

Views

Reviewer Report 16 Sep 2025

Neelam Joshi, Computer Engineering, Fr. Conceicao Rodrigues Institute of Technology, Navi Mumbai, Maharashtra, India

Krishna Kumar Joshi, Rajiv Gandhi Proudyogiki Vishwavidyalaya, MP, India

Approved

https://doi.org/10.5256/f1000research.183275.r411315

In this paper, the authors propose an X-TLRABiLSTM-based approach for predicting Ischemic Heart Disease (IHD) that strikes a balance between accuracy, interpretability, and fairness. Using the UCI Heart Disease dataset, the model combines three key elements: transfer learning from pre-trained BiLSTM models, a residual attention mechanism to capture essential temporal patterns, and SHAP-based explanations to highlight critical risk factors. To mitigate demographic bias, the authors introduce a reweighting strategy that ensures equitable performance across age, gender, and ethnic groups.
Data preprocessing involved fuzzy imputation, normalization, and categorical encoding, while hyperparameters were tuned through grid search. Evaluations using 10-fold cross-validation achieved ~98.2% accuracy, 98.1% F1-score, and 99.1% AUC, surpassing traditional machine learning and recent deep learning baselines. SHAP results revealed chest pain type, ST depression, oldpeak, thalassemia, and maximum heart rate as dominant predictors. The findings demonstrate that X-TLRABiLSTM is a reliable and transparent tool for clinical decision support in cardiovascular disease prognosis. The paper fulfills all the required criteria, including technical soundness and content quality. It is also up to the mark and can be published here.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Machine Learning, Deep Learning, Computer Vision, AI

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 22 Sep 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

22 Sep 2025

Author Response

We sincerely thank the reviewer for taking the time to evaluate our manuscript and approving the work. We are grateful for the approval and encouragement, which supports the value of ... Continue reading We sincerely thank the reviewer for taking the time to evaluate our manuscript and approving the work. We are grateful for the approval and encouragement, which supports the value of our work.
We sincerely thank the reviewer for taking the time to evaluate our manuscript and approving the work. We are grateful for the approval and encouragement, which supports the value of our work.
Competing Interests: The author(s) declare that they have no competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 22 Sep 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

22 Sep 2025

Author Response

We sincerely thank the reviewer for taking the time to evaluate our manuscript and approving the work. We are grateful for the approval and encouragement, which supports the value of ... Continue reading We sincerely thank the reviewer for taking the time to evaluate our manuscript and approving the work. We are grateful for the approval and encouragement, which supports the value of our work.
We sincerely thank the reviewer for taking the time to evaluate our manuscript and approving the work. We are grateful for the approval and encouragement, which supports the value of our work.
Competing Interests: The author(s) declare that they have no competing interests. Close
Report a concern

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 04 Jul 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 3 (revision) 19 Nov 25
Version 2 (revision) 24 Sep 25			read
Version 1 04 Jul 25	read	read

Neelam Joshi, Fr. Conceicao Rodrigues Institute of Technology, Navi Mumbai, India

Krishna Kumar Joshi, Rajiv Gandhi Proudyogiki Vishwavidyalaya, MP, India
Ramesh Chandra Poonia, CHRIST, Delhi, India; Christ University (Ringgold ID: 585354), Bangalore, India
Karna Vishnu Vardhana Reddy, Aditya University, Surampalem, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

8 Views

08 Nov 2025 | for Version 2

Karna Vishnu Vardhana Reddy, Aditya University, Surampalem, Andhra Pradesh, USA

8 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Yes

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biomedical applications, machine learning, deep learning, Artificial Intelligence

Respond to this report

Responses (1)

Author Response

25 Nov 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

An Explainable Transfer Learning based Residual Attention BiLSTM Model for Fair and Accurate Prognosis of Ischemic Heart Disease
Responses to peer review reports
Reviewer#3, Concern # 1: Please provide the number of samples having missing values from ‘thal’ and ‘ca’ attributes.
Author response: We thank the reviewer for this helpful observation. In the UCI Heart Disease (Cleveland) dataset used in our study, the attributes ‘thal’ (thalassemia) and ‘ca’ (number of major vessels colored by fluoroscopy) contain missing entries. Specifically,

‘thal’ has 2 missing samples, and
‘ca’ has 4 missing samples.

Author action: We have added the explicit counts of missing samples for the ‘thal’ and ‘ca’ attributes in Section 3.2.1 in the revised manuscript for clarity.

Reviewer#3, Concern # 2: Why the mean or mode imputation is risky? Please discuss in what way it is risky in detail.
Author response: We appreciate the reviewer’s insightful question. Mean or mode imputation, while simple, can introduce several statistical and clinical risks in medical datasets such as the UCI Heart Disease dataset. Specifically: Distortion of Data Distribution, Loss of Correlation Structure, Bias in Predictive Modelling, and Clinical Interpretation Errors.
Author action: We have expanded the explanation in Section 3.2.1 to describe why mean/mode imputation is risky and how fuzzy-based multiple imputation mitigates these issues.

Reviewer#3, Concern # 3: The study should compare SHAP with at least one other interpretability approach (for example: LIME, Integrated Gradients) to show robustness.
Author response: We thank the reviewer for this valuable suggestion. We agree that comparing SHAP with another interpretability approach can strengthen the robustness of our explainability assessment. Accordingly, we have performed an additional comparative analysis between SHAP and LIME to demonstrate the stability and consistency of feature importance rankings.
Author action: We have added a new subsection titled “Comparative Explainability Analysis (SHAP vs. LIME)” under Section 4.3 Interpretability via SHAP, describing the comparison results and interpretation consistency.

Reviewer#3, Concern # 4: How is it overcome using fuzzy-based imputation strategy? Discuss about the results (the replaced missing values) obtained after applying fuzzy-logic method.
Author response: We thank the reviewer for the valuable comment. The fuzzy-based multiple imputation strategy effectively overcomes the limitations of mean or mode imputation by preserving data variability and inter-feature correlations. Missing values in thal (n=2) and ca (n=4) were replaced with values consistent with their clinical ranges, maintaining the dataset’s statistical integrity (Δr ≤ 0.02).
Author action: We have elaborated Section 3.2.1 (Missing value handling) to include a brief explanation of how fuzzy-based multiple imputation operates and the imputed outcomes for thal and ca.

Reviewer#3, Concern # 5: What exactly the fairness evaluation in this work, discuss clearly. Why it was performed only on age, gender features?
Author response: We appreciate the reviewer’s insightful query. Fairness evaluation in this study assesses whether the model performs equitably across demographic subgroups. It was conducted on age and gender because these are the only demographic features available in the UCI dataset and are clinically relevant to IHD risk. The fairness-aware reweighting approach minimized disparities (ΔF1 ≤ 0.6; error rate gap ≤ 0.4), confirming unbiased model behavior.
Author action: We have revised Section 4.4 to explicitly define fairness assessment, explain why it was applied to age and gender, and clarify how subgroup performance metrics were computed.

Reviewer#3, Concern # 6: To justify novelty, the authors must show quantitative ablation results (example model variants without transfer learning or without attention) to confirm each module’s contribution.
Author response: We thank the reviewer for this valuable suggestion. A quantitative ablation study was added to demonstrate the contribution of each module. Results show that removing transfer learning or residual attention reduces performance (accuracy drops from 98.2% to 95.6% and 96.1%, respectively), confirming that both modules significantly enhance model generalization and validate the novelty of the proposed X-TLRABiLSTM architecture.
Author action: We have added a new subsection titled “4.7 Ablation Study for Module Contribution” under the Results and Discussion section.

Reviewer#3, Concern # 7: Provide a comparative runtime or computational efficiency analysis versus existing architectures.
Author response: We appreciate the reviewer’s insightful suggestion. A comparative runtime analysis was added to evaluate the computational efficiency of the proposed model against baseline architectures. Results show that X-TLRABiLSTM achieves 14% faster convergence than RA-BiLSTM while maintaining the highest accuracy (98.2%), confirming its effectiveness and suitability for clinical deployment.
Author action: We have added a new subsection titled “4.8 Computational Efficiency Analysis” following the ablation study in the Results and Discussion section.

Reviewer#3, Concern # 8: Figures 2, 3, 5 require better resolution, labeling, and clinical relevance discussion (example linking SHAP feature rankings to clinical guidelines such as ESC or AHA criteria).
Author response: We thank the reviewer for this valuable suggestion. Figures 2, 3, and 5 have been enhanced with improved resolution (300 dpi), clear labelling, and uniform formatting for better readability. Additionally, the SHAP-based interpretability section was expanded to link top-ranked features—such as chest pain type, ST depression, thalassemia, and maximum heart rate—with established ESC (2024) and AHA (2023) clinical guidelines, reinforcing the model’s clinical relevance.
Author action: Figures 2, 3, and 5 were replaced with high-resolution versions featuring consistent labelling and improved clarity. A new paragraph was added at the end of Section 4.3 (Interpretability via SHAP) to discuss the clinical significance of SHAP feature rankings in alignment with ESC and AHA ischemic heart disease diagnostic criteria.

Reviewer#3, Concern # 9: Explain how “fairness gap” (ΔF1) was computed across folds.
Author response: We thank the reviewer for this important query. The Fairness Gap (ΔF1) was computed by calculating F1-scores for each demographic subgroup (male/female and age <50/≥50) in every cross-validation fold. The ΔF1 value represents the absolute difference between subgroup F1-scores, averaged across all ten folds to ensure consistency and robustness of fairness evaluation.
Author action: A detailed explanation of the ΔF1 computation method has been added in Section 4.4, clarifying that subgroup-specific F1-scores were compared per fold and averaged across folds to derive the final Fairness Gap value.

Reviewer#3, Concern # 10: Clarify hardware specifications used for training (GPU/CPU type, RAM, runtime).
Author response: We appreciate the reviewer’s suggestion. Hardware details have been added for clarity and reproducibility. Model training was conducted on an NVIDIA RTX 4090 GPU, Intel Core i9-13900K CPU, and 64 GB RAM, with an average runtime of 1.17 seconds per epoch.
Author action: Hardware specifications were included in Section 3.5 and cross-referenced in Section 4.8.

Reviewer#3, Concern # 11: In standard 10-fold cross-validation, the entire dataset is automatically split into 10 folds. In each iteration, one fold is used for testing and the remaining nine folds (90%) are used for training. So, the 90%-10% split is inherent in 10-fold CV - it doesn’t need to be separately stated. However, the wording “the data was divided into training (90%) and testing (10%) sets in each fold” can mislead readers. Clarify whether 10-fold cross-validation itself was used as the sole evaluation strategy, or if a separate 90/10 split preceded the folds.
Author response: We appreciate the reviewer’s clarification. Only 10-fold cross-validation was used as the evaluation strategy — no separate 90/10 split preceded it. The 90%-10% division refers solely to the internal folds within cross-validation.
Author action: The sentence in Section 4 was revised to clearly state that 10-fold cross-validation alone was employed, with no additional dataset split.

Reviewer#3, Concern # 12: References
Karna V, Karna V, Janamala V, Devana V, et al.: A Comprehensive Review on Heart Disease Risk Prediction using Machine Learning and Deep Learning Algorithms. Archives of Computational Methods in Engineering. 2025; 32 (3): 1763-1795
Author response: We thank the reviewer for pointing out this reference. The suggested citation by Karna et al. (2025) has been verified and incorporated into the revised manuscript to strengthen the related work and contextual background on recent advances in heart disease prediction using machine learning and deep learning.
Author action: The reference has been added in Section 2.

View more View less

Competing Interests

The authors declare no competing interests related to the content, evaluation, or interpretation of this article or the peer review reports.

Back to all reports

Reviewer Report

15 Views

17 Sep 2025 | for Version 1

Ramesh Chandra Poonia, CHRIST, Delhi, India; Department of Computer Science, Christ University (Ringgold ID: 585354), Bangalore, Karnataka, India

15 Views Cite this report Responses(1)

Approved With Reservations

Suggestions for Improvement:

The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Computational Intelligence, Cyber Physical Systems, Machine Learning and Artificial Intelligence

Respond to this report

Responses (1)

Author Response

24 Sep 2025

VIJAYA ARJUNAN RANGANATHAN, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India

Reviewer#2, Concern # 1: The phrase “Fair and Accurate Prognosis” could be shortened to just “Accurate Prognosis” unless fairness is explicitly addressed with a defined framework or metrics.
Author response: We thank the reviewer for this observation. In our study, fairness is not only a conceptual point but is explicitly addressed through a demographic reweighting strategy and quantitative fairness evaluation. Specifically, we applied fairness-aware loss functions during training and reported subgroup-level performance (ΔF1 ≤ 0.6, error rate gap ≤ 0.4) across gender and age categories. These results demonstrate that the model achieves equitable performance across demographic groups, thereby justifying the inclusion of the term “Fair” in the title.
Author action: We have clarified this more explicitly in the abstract, introduction, and results sections to highlight the fairness framework and metrics applied in the study. For example, we now state in the abstract that fairness evaluation revealed minimal disparity across demographic subgroups, and we emphasize in the methods/results that fairness-aware optimization was implemented and quantitatively validated.

Reviewer#2, Concern # 2: Consider slightly restructuring for brevity, e.g.,
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease”
or
“Explainable Residual Attention BiLSTM with Transfer Learning for Accurate Ischemic Heart Disease Prognosis”
Author response: We thank the reviewer for this helpful suggestion and agree that a more concise title improves readability.
Author action: The title has been revised to:
“Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.”

Reviewer#2, Concern # 3: Ensure that the paper includes a strong justification for each component mentioned in the title (e.g., why BiLSTM over standard LSTM, why transfer learning is suitable for the dataset, and how fairness is ensured).
Author response: We thank the reviewer for this constructive suggestion. We agree that the rationale for each component of the proposed framework should be clearly articulated. Our revised manuscript now emphasizes:

Why BiLSTM over standard LSTM: BiLSTM captures both forward and backward temporal dependencies, which is critical in modelling clinical progression patterns where both prior and subsequent states provide diagnostic cues. This leads to richer feature representations compared to standard LSTM.
Why transfer learning is suitable: The UCI Heart Disease dataset is relatively small and limited in diversity. Transfer learning from pre-trained cardiovascular models improves generalisation, accelerates convergence, and reduces overfitting on small datasets.
How fairness is ensured: A demographic reweighting strategy was integrated into the loss function, ensuring balanced performance across subgroups (age and gender). We quantitatively validated fairness with subgroup-specific F1-scores and error rates (ΔF1 ≤ 0.6, error rate gap ≤ 0.4).

Author action: We revised the Introduction (Section 1) to include clearer motivation for BiLSTM, transfer learning, and fairness. Additionally, we expanded the Methods (Sections 3.3.1 and 3.4) to justify these components explicitly and highlighted the fairness evaluation more clearly in the Results (Section 4.4).

View more View less

Competing Interests

The authors declare no competing interests.

Back to all reports

Reviewer Report

19 Views

16 Sep 2025 | for Version 1

Neelam Joshi, Computer Engineering, Fr. Conceicao Rodrigues Institute of Technology, Navi Mumbai, Maharashtra, India

Krishna Kumar Joshi, Rajiv Gandhi Proudyogiki Vishwavidyalaya, MP, India

19 Views Cite this report Responses(1)

Approved

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine Learning, Deep Learning, Computer Vision, AI

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. World Health Organization: Cardiovascular Diseases (CVDs).2023. Reference Source

[2] 2. Bottardi A, et al.: Clinical Updates in Coronary Artery Disease: A Comprehensive Review. J. Clin. Med. 2024; 13(16): 4600.

[3] 3. Cenitta D, et al.: Ischemic Heart Disease Prognosis: A Hybrid Residual Attention-Enhanced LSTM Model. IEEE Access. 2024.

[4] 4. Bhavekar GS, Goswami AD: A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int. J. Inf. Technol. 2022; 14(4): 1781–1789.

[5] 5. Liu Y, et al.: Automatic detection of ECG abnormalities by using an ensemble of deep residual networks with attention. Machine Learning and Medical Engineering for Cardiovascular Health and Intravascular Imaging and Computer Assisted Stenting: First International Workshop, MLMECH 2019, and 8th Joint International Workshop, CVII-STENT 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 1. Springer International Publishing; 2019.

[6] 6. Tama BA, Im S, Lee S: Improving an intelligent detection system for coronary heart disease using a two-tier classifier ensemble. Biomed. Res. Int. 2020; 2020(1): 9816142.

[7] 7. Rani P, et al.: A decision support system for heart disease prediction based upon machine learning. J. Reliab. Intell. Environ. 2021; 7(3): 263–275.

[8] 8. Karna VVR, et al.: A comprehensive review on heart disease risk prediction using machine learning and deep learning algorithms. Arch. Comput. Method Eng. 2025; 32(3): 1763–1795.

[9] 9. Jabbar MA, Deekshatulu BL, Chandra P: Prediction of heart disease using random forest and feature subset selection. Innovations in Bio-Inspired Computing and Applications: Proceedings of the 6th International Conference on Innovations in Bio-Inspired Computing and Applications (IBICA 2015) held in Kochi, India during December 16-18, 2015. Cham: Springer International Publishing; 2015.

[10] 10. Shah SMS, et al.: Support vector machines-based heart disease diagnosis using feature subset, wrapping selection and extraction methods. Comput. Electr. Eng. 2020; 84: 106628.

[11] 11. Stojanov D, et al.: Predicting the outcome of heart failure against chronic-ischemic heart disease in elderly population–Machine learning approach based on logistic regression, case to Villa Scassi hospital Genoa, Italy. J. King Saud Univ. Sci. 2023; 35(3): 102573.

[12] 12. Bhavekar GS, Goswami AD: A hybrid model for heart disease prediction using recurrent neural network and long short term memory. Int. J. Inf. Technol. 2022; 14(4): 1781–1789.

[13] 13. Doppala BP, et al.: A reliable machine intelligence model for accurate identification of cardiovascular diseases using ensemble techniques. J. Healthc. Eng. 2022; 1(2022): 2585235.

[14] 14. Suresh T, et al.: A hybrid approach to medical decision-making: diagnosis of heart disease with machine-learning model. Int. J. Elec. Comp. Eng. 2022; 12(2): 1831–1838.

[15] 15. Ampavathi A, Vijaya Saradhi T: Multi disease-prediction framework using hybrid deep learning: an optimal prediction model. Comput. Methods Biomech. Biomed. Engin. 2021; 24(10): 1146–1168.

[16] 16. Liu Y, et al.: Automatic detection of ECG abnormalities by using an ensemble of deep residual networks with attention. Machine Learning and Medical Engineering for Cardiovascular Health and Intravascular Imaging and Computer Assisted Stenting: First International Workshop, MLMECH 2019, and 8th Joint International Workshop, CVII-STENT 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 1. Springer International Publishing; 2019.

[17] 17. Cenitta D, et al.: Ischemic Heart Disease Prognosis: A Hybrid Residual Attention-Enhanced LSTM Model. IEEE Access. 2024.

[18] 18. Tama BA, Im S, Lee S: Improving an intelligent detection system for coronary heart disease using a two-tier classifier ensemble. Biomed. Res. Int. 2020; 2020(1): 9816142.

[19] 19. Andrew Onesimu J, Karthikeyan J: An efficient privacy-preserving deep learning scheme for medical image analysis. J. Inf. Technol. Manag. 2020; 12: 50–67. Special Issue: The Importance of Human Computer Interaction: Challenges, Methods and Applications.

[20] 20. Guo C, et al.: Recursion enhanced random forest with an improved linear model (RERF-ILM) for heart disease detection on the internet of medical things platform. IEEE Access. 2020; 8: 59247–59256.

[21] 21. Li X, et al.: Automatic heartbeat classification using S-shaped reconstruction and a squeeze-and-excitation residual network. Comput. Biol. Med. 2022; 140: 105108.

[22] 22. Johnson KW, et al.: Artificial intelligence in cardiology. J. Am. Coll. Cardiol. 2018; 71(23): 2668–2679.

[23] 23. Ahmed M, Husien I: Heart disease prediction using hybrid machine learning: A brief review. J. Robot. Control. 2024; 5(3): 884–892.

[24] 24. Rani P, et al.: A decision support system for heart disease prediction based upon machine learning. J. Reliab. Intell. Environ. 2021.

[25] 25. Cleveland: UCI Heart Disease Dataset.Reference Source

[26] 26. Cenitta D, Vijaya Arjunan R: Ischemic heart disease multiple imputation technique using machine learning algorithm. Eng. Sci. 2022; 19(6): 262–272.

[27] 27. Goodfellow I: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160. 2016.

[28] 28. Rajkomar A, Dean J, Kohane I: Machine learning in medicine. N. Engl. J. Med. 2019; 380(14): 1347–1358.

[29] 29. Graves A: Supervised sequence labelling with recurrent neural networks.2012.

[30] 30. Wang F, et al.: Residual attention network for image classification. Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

[31] 31. Lundberg SM, Lee S-I: A unified approach to interpreting model predictions. Adv. Neural Inf. Proces. Syst. 2017; 30.

[32] 32. Obermeyer Z, et al.: Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019; 366(6464): 447–453.

[33] 33. Li X, et al.: Automatic heartbeat classification using S-shaped reconstruction and a squeeze-and-excitation residual network. Comput. Biol. Med. 2022; 140: 105108.

[34] 34. Mehrabi N, et al.: A survey on bias and fairness in machine learning. ACM computing surveys (CSUR). 2021; 54(6): 1–35.

[35] 35. Rajkomar A, et al.: Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 2018; 169(12): 866–872.

[36] 36. Obermeyer Z, et al.: Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019; 366(6464): 447–453.

[37] 37. Ogundokun RO, et al.: Medical internet-of-things based breast cancer diagnosis using hyperparameter-optimized neural networks. Future Internet. 2022; 14(5): 153.

[38] 38. Esteva A, et al.: A guide to deep learning in healthcare. Nat. Med. 2019; 25(1): 24–29.

[39] 39. Suresh H, Guttag JV: A framework for understanding unintended consequences of machine learning. arXiv preprint arXiv:1901.10002. 2019; 2(8): 73.

Explainable Transfer Learning with Residual Attention BiLSTM for Prognosis of Ischemic Heart Disease.

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 2

1. Introduction

2. Related works

2.1 Traditional machine learning approaches

2.2 Deep learning and hybrid models

2.3 Attention and residual learning mechanisms

2.4 Explainable AI in cardiovascular prediction

2.5 Transfer learning for cardiovascular diagnosis

2.6 Fairness and bias mitigation in clinical AI

2.7 Outcome of the literature survey

3. Materials and Methods

3.1 Dataset description

Table 1. Description of clinical features in the UCI heart disease dataset.

3.2 Data preprocessing

3.3 Model architecture: X-TLRABiLSTM

Figure 1. Explainable transfer learning-based residual attention BiLSTM model architecture.

Algorithm 1. X-TLRABiLSTM training.

3.4 Demographic reweighting for fairness

3.5 Hyperparameter tuning and evaluation metrics

Table 2. Hyperparameter tuning grid.

3.6 Key enhancements of the proposed model

4. Results and Discussion

4.1 Classification performance

Table 3. Performance comparison with existing models.

Figure 2. Performance comparison of X-TLRABiLSTM and baseline models (Accuracy, F1-score, and AUC).

Figure 3. ROC curve showing high discrimination performance of the proposed X-TLRABiLSTM model (AUC = 0.991).

4.2 Confusion matrix analysis

Figure 4. Confusion matrix diagram for the best-performing X-TLRABiLSTM model and analyze the TP, FP, TN, and FN.

4.3 Interpretability via SHAP

Figure 5. SHAP summary plot illustrating feature contributions to ischemic heart disease prediction.

4.4 Fairness evaluation

Table 4. Fairness evaluation.

Figure 6. Fairness evaluation for the X-TLRABiLSTM model.

4.5 Comparative performance

Table 5. Summarises our model’s performance relative to recent literature.

4.6 Discussion

4.7 Ablation study for module contribution

Table 6. Ablation results of X-TLRABiLSTM model components.

4.8 Computational efficiency analysis

Table 7. Comparative runtime efficiency of X-TLRABiLSTM and existing models.

5. Conclusion

Disclaimer/Publisher’s note

Data availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated