Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.166350.1

Research Article

Articles

A Hybrid Anomaly Detection Framework Combining Supervised and Unsupervised Learning for Credit Card Fraud Detection

[version 1; peer review: 1 approved with reservations, 1 not approved]

Shanaa

Mohammad

Conceptualization Data Curation Formal Analysis Methodology Resources Software Visualization Writing – Original Draft Preparation https://orcid.org/0000-0001-9787-1408 a 1 Abdallah

Sherief

Supervision Validation Writing – Review & Editing 1 1Faculty of Engineering and IT, The British University in Dubai, Dubai, Dubai, United Arab Emirates

a mohammadsshanaa@gmail.com

No competing interests were disclosed.

7 7 2025

2025

664

26 6 2025

2025

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Credit card fraud detection remains a major challenge because of the highly imbalanced nature of transaction data. Conventional supervised models often suffer from low recall or high false positive rates, whereas unsupervised methods lack precision.

Methods

In this study, we propose a hybrid anomaly detection framework that combines an unsupervised autoencoder trained on normal transactions to capture reconstruction error patterns with a supervised XGBoost classifier trained on the same dataset. The hybrid system integrates both scores via an optimized thresholding mechanism to balance sensitivity and specificity. We evaluated the model on the publicly available Kaggle creditcard.csv dataset comprising 284,807 transactions, with only 492 labelled fraudulent.

Results

The proposed model achieved superior performance, with a recall of 0.9250, precision of 0.9569, F1-score of 0.9407, and Matthews Correlation Coefficient (MCC) of 0.9407, with an accuracy of 0.9998, surpassing the results of similar published models using the same dataset.

Conclusions

This framework provides a practical, reproducible, high-performance solution for detecting financial fraud. The code, model configuration, and data-processing pipeline were made available to support transparency and future research.

Fraud Detection Autoencoder Isolation Forest XGBoost Random Forest Hybrid Model Anomaly Detection Imbalanced Dataset.

The author(s) declared that no grants were involved in supporting this work.

Introduction

Credit card fraud remains one of the most persistent and damaging threats to the digital financial ecosystem. As the volume of online transactions continues to grow, so too does the complexity of fraudulent activities increases. Global losses are projected to exceed $40 billion annually by 2025, driven by the increasing digitalization of financial services and constant evolution of fraud tactics. ¹ The core challenge in this domain lies in accurately detecting fraudulent transactions that are rare (less than 1%), adaptive, and often indistinguishable from legitimate user behavior. This imbalance (between legitimate and fraudulent transactions) significantly impairs the performance of both conventional and machine learning-based detection systems, often leading to biased predictions and poor generalizability across datasets. ^{2,
3} Traditional fraud detection methods struggle to scale effectively in such dynamic and imbalanced environments, frequently resulting in missed fraud cases or excessive false positives. Detection systems frequently encounter difficulties in balancing sensitivity and specificity; enhancing fraud detection (true positives) often leads to an increase in false positives, thereby disrupting the customer experience and straining resources. Conversely, conservative models may fail to identify fraudulent activities, leading to financial losses and reputational harm.

Recent research has highlighted the potential of hybrid models that combine supervised classification techniques with unsupervised anomaly detection to enhance both the precision and robustness of fraud detection. For instance, studies integrating techniques, such as autoencoders, isolation-based methods, and gradient boosting classifiers, have demonstrated improved performance in identifying complex and evolving fraud patterns. ⁴ However, many of these models still lack generalizability or require substantial computational resources, which limits their practical application in real-time financial environments.

The aim of this study is to develop and evaluate a hybrid anomaly detection framework that integrates both supervised and unsupervised learning techniques to improve the accuracy, robustness, and generalizability of credit card fraud-detection systems. This study specifically targets the challenges posed by imbalanced data, evolving fraud patterns, and limitations of single-model detection strategies.

Our approach is empirically validated using the publicly available European credit card fraud dataset, which presents realistic challenges including severe class imbalance. We conducted comprehensive experiments to measure the performance of the model across standard evaluation metrics and benchmarked its results against state-of-the-art techniques. Using this approach, this study aims to demonstrate the practical value and academic contribution of hybrid learning models in improving credit card fraud detection.

This study makes the following contributions: 1.

A novel hybrid anomaly detection framework that integrates supervised (XGBoost, Random Forest) and unsupervised (Autoencoder, Isolation Forest) models is proposed to address the challenges of data imbalance and concept drift in credit card fraud detection.

Comparative analysis of the hybrid model against state-of-the-art models using the publicly available and widely adopted Kaggle creditcard.csv dataset.

A reproducible pipeline suitable for adaptation in real-world applications that balances detection accuracy with computational efficiency.

Related work

Credit card fraud detection has become increasingly critical with the rapid expansion of online transactions and growing sophistication of fraudulent activities. Contemporary trends underscore the adoption of advanced machine learning (ML) techniques, which have shown considerable promise in enhancing both the accuracy and efficiency of fraud-detection systems. Nevertheless, these advancements have introduced several challenges, particularly the limitations of traditional anomaly detection methods and the constraints inherent in current ML-based models.

Traditional approaches to anomaly detection, including rule-based systems and statistical models, have long served as the foundation for fraud detection. However, these techniques frequently struggle to address the dynamic and adaptive nature of fraudulent behavior, which often mimics legitimate transaction patterns. Consequently, they tend to exhibit high false positive rates. ^{5,
6} Moreover, such approaches generally fail to scale effectively with the vast and continuously growing volume of transaction data, rendering them less viable in real-time fraud detection scenarios. ^{7,
8} Consequently, there has been an increasing shift toward machine learning models that are better equipped to manage large datasets and adapt to evolving fraud strategies. ^{9,
10}

Despite their advantages, the existing ML models are not without limitations. A primary concern is the class imbalance inherent in credit card transaction datasets, where legitimate transactions overwhelmingly outnumber fraudulent transactions. This imbalance often leads to skewed model performance, resulting in a high rate of false negatives in which fraudulent transactions remain undetected. ^{2,
3} Additionally, many ML models demand extensive feature engineering and frequently struggle to generalize across datasets because of variations in consumer behavior and transaction patterns. ^{11,
12} The scarcity of accurately labelled fraudulent transactions further complicates the training process, as acquiring such labels is challenging in real-world settings. ¹³

Hybrid approaches have emerged as promising solutions for mitigating these issues. By combining different methodologies, researchers have been able to enhance the detection accuracy and reduce false positives. ^{14,
15} For example, hybrid models that integrate convolutional neural networks with support vector machines have demonstrated improved performance in identifying anomalies in financial datasets. ¹⁵ These methods exploit the strengths of diverse algorithms and contribute more robust and generalizable detection capabilities. Moreover, similar hybrid strategies have shown effectiveness in other domains facing anomaly detection challenges, including healthcare and cybersecurity. ¹⁴

In the context of fraud detection research, several benchmark datasets are frequently used, notably the European Credit Card Transactions dataset and the Kaggle Credit Card Fraud Detection dataset. These datasets are distinguished by their high dimensionality and extreme class imbalance, with fraudulent instances often comprising less than 1% of the total records. ^{2,
3} In particular, the European dataset includes anonymized transaction features derived from Principal Component Analysis (PCA) to ensure user privacy, making it suitable for academic use. ^{12,
16} Such datasets are instrumental in training and evaluating fraud-detection models because they closely reflect the complexities encountered in real-world applications.

In summary, although traditional anomaly detection techniques have laid the foundational framework for credit card fraud detection, the adoption of machine learning and hybrid methodologies opens new possibilities for improving the detection efficacy. Nonetheless, persistent challenges necessitate ongoing research in this field. The advancement of more sophisticated hybrid models and the utilization of comprehensive real-world datasets will be essential to overcome these hurdles and further progress in this critical area.

In the domain of credit card fraud detection, unsupervised learning methods have garnered increasing attention owing to their capacity to identify anomalies without relying on labelled data. Among these, clustering algorithms such as DBSCAN and HDBSCAN have demonstrated considerable potential. For instance, ¹ reported that combining HDBSCAN with UMAP and SMOTE enables the identification of previously unseen fraud patterns, while significantly reducing false positives. Similarly, deep-learning-based anomaly detection frameworks, such as the attentional anomaly detection network proposed by, ¹⁶ show promise for capturing behavioral transaction anomalies without the need for predefined class labels. These approaches are particularly advantageous in real-world contexts where labelled fraudulent data are limited, allowing the detection of novel fraud patterns that traditional supervised models may overlook. ¹⁷

Conversely, supervised learning techniques, particularly gradient boosting methods, such as XGBoost, have been widely adopted owing to their robustness and interpretability. ² highlighted the effectiveness of XGBoost when paired with data augmentation strategies, such as SMOTE ENN, achieving high accuracy with low false-positive rates. Further evidence from ¹⁸ demonstrated that integrating XGBoost with resampling methods enhanced the overall performance across a range of machine learning models. Notably, the inherent capability of XGBoost to handle imbalanced datasets makes it particularly well-suited for credit card fraud detection, where fraudulent transactions comprise only a small fraction of the total dataset. ¹⁰

Hybrid approaches integrating supervised and unsupervised learning have emerged as promising strategies, ¹⁴ for example, presented a deep learning model combined with SMOTE oversampling, which effectively addressed the class imbalance issue while improving the detection accuracy. Similarly, ¹⁹ illustrated the benefits of combining neural networks with traditional machine learning techniques to enhance the overall detection efficacy. These hybrid models exploit the complementary strengths of each learning paradigm, thereby resulting in adaptive and accurate systems.

Despite these advancements, several persistent challenges continue to hinder optimal fraud detection performance. A primary issue is class imbalance, wherein the overwhelming dominance of legitimate transactions can bias models and reduce their sensitivity to fraudulent instances. ¹¹ Additionally, the constantly evolving tactics of fraudsters necessitate frequent model retraining and updates, which can be both computationally and operationally demanding. ¹¹ Scalability is also a concern, as many models exhibit performance degradation when deployed in large-scale or real-time transaction streams. ²⁰

The performance metrics across existing models vary significantly in terms of scalability, accuracy, and operational efficiency. Research indicates that ensemble techniques that combine multiple classifiers tend to outperform individual models in terms of their robustness and accuracy. ²¹ However, the increased computational requirements of ensemble models may limit their applicability in time-sensitive scenarios. ²⁰ In contrast, XGBoost has often been identified as a suitable compromise, offering a favorable balance between predictive performance and computational efficiency, which makes it attractive for real-world fraud detection systems. ^{2,
22}

Research into hybrid anomaly detection models typically seeks to fulfil several key objectives, including enhancing detection accuracy, improving robustness against emerging fraud patterns, and integrating both supervised and unsupervised learning techniques to capitalize on the strengths of each approach. Hybrid models are particularly advantageous in scenarios where labelled data are limited because they enable the use of unsupervised methods to identify anomalies, whereas supervised models refine and validate these detections. ^{23–
25} For example, integrating supervised models that learn from historical transaction data with unsupervised models capable of detecting novel anomalies facilitates a more comprehensive detection framework, addressing the limitations of methods that rely solely on a single-learning paradigm. ^{23,
24}

The literature highlights notable gaps in existing anomaly detection frameworks, particularly their limited adaptability to evolving fraud patterns and poor generalizability across diverse datasets. Hybrid models offer a promising solution to these issues by leveraging various data sources and learning strategies, thereby increasing their effectiveness in real-world deployment. ^{26,
27} For instance, studies incorporating Generative Adversarial Networks (GANs) into traditional machine learning workflows have demonstrated improved detection of complex fraud patterns that may elude conventional models. ⁴ Moreover, the flexibility of hybrid models supports continuous learning and adaptation, which are essential features of the constantly evolving fraud landscape. ^{23,
24}

Success in fraud detection research is typically measured using performance metrics such as accuracy, precision, recall, and F1-score, which collectively evaluate a model’s capability to correctly identify fraudulent transactions while maintaining operational efficiency. ^{28,
29} Minimizing false positives and effectively identifying previously unseen fraud cases are also critical indicators of success. ^{23,
24} Models that strike a balance between high accuracy and low false positive rates are particularly valued, as they reduce the burden of manual transaction reviews and minimize disruption to legitimate users. ^{23,
24,
29}

Both supervised and unsupervised learning play an integral role in addressing the research challenges in fraud detection. Supervised learning is particularly effective when sufficient labelled data are available, enabling the model to learn the distinctions between fraudulent and non-fraudulent transactions. ^{30,
31} By contrast, unsupervised learning excels in scenarios where labels are unavailable, identifying novel or emerging fraud patterns without prior examples. ^{23–
25} The integration of both techniques enhances not only the model’s detection capacity but also the interpretability and adaptability of the fraud detection framework, as evidenced by research that underscores their complementary nature. ^{24,
25}

In the literature, “success” in fraud detection is frequently defined in terms of balancing detection performance with operational efficiency. This includes the ability to accurately detect fraudulent transactions with minimal false positives, thereby ensuring that genuine users are not adversely affected. ^{23–
25} Furthermore, a model’s adaptability to new fraud typologies and its performance across various datasets are equally important for assessing its practical applicability and overall robustness. ^{23–
25}

Unsupervised methods Autoencoders

Autoencoders have emerged as powerful tools for feature extraction in anomaly detection, particularly fraud detection. By leveraging their ability to learn compressed representations of data, autoencoders can effectively identify anomalies by reconstructing the input data and measuring the reconstruction error. This process allows for the extraction of relevant features that distinguish normal data from anomalies, as the model learns to ignore noise and irrelevant information during training. ^{32–
34} The architecture of autoencoders, which typically consists of an encoder and decoder, facilitates dimensionality reduction, making them suitable for high-dimensional datasets often encountered in fraud detection scenarios. ^{35,
36}

Despite their advantages, autoencoders have limitations when applied to unsupervised-learning tasks. A significant challenge is determining an appropriate reconstruction error threshold, which is crucial for distinguishing between normal and anomalous instances. This threshold can be influenced by the distribution of reconstruction errors, and improper selection may lead to high false positive rates or missed detections. ^{33,
37,
38} Moreover, autoencoders can struggle with class imbalances because they are typically trained on predominantly normal data, making it difficult to generalize to rare fraudulent instances. ^{37,
39} Additionally, the complexity of the model can lead to overfitting, particularly when the training dataset is small or lacks diversity. ^{40,
41}

When comparing autoencoders to other unsupervised methods in fraud detection, such as clustering and traditional statistical methods, autoencoders often demonstrate superior performance because of their ability to learn complex, non-linear relationships in the data. ^{35,
39,
42} For example, while clustering methods may struggle with high-dimensional data, autoencoders can effectively reduce dimensionality and capture intricate patterns that signify fraudulent behavior. ^{35,
39,
42} Furthermore, ensemble methods that combine autoencoders with other algorithms, such as Random Forests or Gradient Boosting, have shown promising results in improving detection accuracy and robustness against class imbalance. ^{40,
41}

In summary, autoencoders are effective for feature extraction in anomaly detection, particularly fraud. Their architectures, such as VAEs and LSTM autoencoders, are suitable for various data types. However, issues, such as threshold determination and class imbalance, require further investigation. In this study, we combined autoencoders with other models to enhance the results and address these challenges.

Isolation forest

The Isolation Forest algorithm is a powerful tool for anomaly detection, particularly in financial datasets. It operates based on the principle of isolating anomalies, instead of profiling normal data points. This is achieved by constructing a random forest of isolation trees, where each tree is built by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that feature. Anomalies are identified as instances that require fewer splits to be isolated because they are often located far from the majority of the data points in the feature space. ^{43,
44} This characteristic makes isolation forests particularly effective in high-dimensional datasets, where traditional methods may struggle owing to the curse of dimensionality. Studies have shown that isolation forests maintain robust performance in high-dimensional settings, effectively identifying outliers, even when dimensionality increases significantly. ^{43,
45}

Parameter tuning is crucial for optimizing the performance of an isolation-forest algorithm. Common techniques include adjusting the number of trees in the forest and subsampling size, which can influence the sensitivity of the model to anomalies. For instance, increasing the number of trees generally improves the robustness of the model, while the subsampling size can be tuned to balance between computational efficiency and detection accuracy. ^{45,
46} In terms of computational advantages, the Isolation Forest algorithm is highly efficient and requires linear time complexity relative to the number of data points, making it scalable for large datasets. ^{44,
47}

Isolated forests can also be integrated into hybrid models to enhance their anomaly detection capabilities. For example, it can be combined with supervised learning techniques to refine the detection process by leveraging labelled data for training. This integration allows for improved feature selection and anomaly characterization, leading to better overall performance in detecting complex patterns in financial datasets. ^{48,
49} Such hybrid approaches can utilize the strengths of multiple algorithms, thereby improving the robustness and accuracy of anomaly detection frameworks in various applications, including fraud detection in banking and finance. ⁴⁹

In summary, the Isolation Forest algorithm is a robust method for detecting anomalies in financial datasets, and is particularly effective in high-dimensional spaces. Parameter tuning plays a critical role in optimizing the performance, whereas its computational efficiency makes it suitable for large datasets. Despite these limitations, the integration of isolated forests with other methods in hybrid models can significantly enhance their anomaly detection capabilities.

Supervised methods XGBoost

Extreme gradient boosting (XGBoost) has emerged as a powerful tool for fraud detection, particularly in the context of imbalanced datasets. The algorithm’s inherent ability to handle imbalanced data stems from its gradient boosting framework, which optimizes the model by focusing on misclassified instances, thereby enhancing its sensitivity to minority classes, such as fraudulent transactions. This characteristic is crucial in fraud detection, where fraudulent cases often significantly outnumber legitimate ones. ^{50,
51} Furthermore, XGBoost incorporates regularization techniques that help mitigate overfitting, which is a common challenge in machine learning models trained on imbalanced datasets ^{50,
51}

Hyperparameter tuning is essential for optimizing the performance of XGBoost in fraud detection tasks. Techniques such as grid search, random search, and more advanced methods such as Bayesian optimization have been employed to identify the most effective hyperparameters. For instance, the use of Bayesian optimization has been shown to enhance the model’s ability to balance training weights for asymmetric examples, which is particularly beneficial in fraud-detection scenarios. ^{52,
53}

When comparing XGBoost with other supervised learning methods, it consistently demonstrates superior performance in fraud-detection tasks. Studies have shown that XGBoost outperforms traditional models such as logistic regression and decision trees as well as other ensemble methods such as random forests. This superiority is attributed to its ability to capture complex non-linear relationships and interactions between features, which are often present in fraud detection datasets. ^{54,
55} Moreover, XGBoost’s feature importance capabilities allow practitioners to gain insights into the most influential predictors of fraud, further enhancing model interpretability and decision-making processes. ^{19,
56}

Researchers have also explored the integration of XGBoost with hybrid anomaly-detection models. For instance, combining XGBoost with unsupervised learning techniques allows for the extraction of patterns from data that can be used as new features, thereby improving the robustness of the model against noise and outliers. ⁵⁷

In conclusion, XGBoost’s optimization for fraud detection in imbalanced datasets is facilitated by its robust handling of misclassifications, effective hyperparameter tuning techniques, and superior performance compared to other supervised learning methods. The role of feature importance is critical in refining model performance, while hybrid approaches continue to expand the capabilities of XGBoost in anomaly detection scenarios.

Random forest

Random Forest (RF) is a versatile ensemble technique that has been broadly applied in anomaly detection for both supervised and semi-supervised learning tasks. In fully supervised settings, RF algorithms are trained with labelled examples covering both normal and anomalous classes, thereby enabling the model to learn complex non-linear decision boundaries that can reliably separate rare and abnormal events. ⁵⁸ In contrast, semi-supervised applications typically exploit RF’s ability to capture underlying data distributions by training exclusively on normal (or “positive”) samples and subsequently flagging deviations as anomalies. ⁵⁹

The performance of RF is particularly noteworthy in high-dimensional and large-scale datasets such as those encountered in credit card fraud detection. RF can naturally handle large numbers of features owing to its random feature subspace selection at each split, which mitigates overfitting and improves generalization. ⁶⁰ Empirical studies have demonstrated that RF-based methods perform competitively in scenarios characterized by rare events, such as fraud detection, by effectively identifying subtle patterns that differentiate fraudulent from legitimate behaviors. ⁶⁰ Nevertheless, the class imbalance inherent in such applications often calls for hybrid or improved approaches, for example, through combination with feature selection procedures or integration with unsupervised algorithms, to further boost detection accuracy.

Hybrid integration

Hybrid models, which combine unsupervised and supervised learning techniques, have gained traction in various fields owing to their ability to leverage the strengths of both approaches. The integration of unsupervised outputs with supervised methods can enhance the predictive performance, particularly in scenarios where labelled data are scarce. This synthesis typically involves several strategies, including feature extraction, ensemble methods, and model stacking, which can significantly improve the overall performance of the hybrid models.

One effective integration strategy is the use of unsupervised learning for feature extraction, which can reduce dimensionality and capture underlying patterns in the data. For instance, autoencoders or clustering algorithms can preprocess data before they are fed into a supervised learning model, thereby enhancing their predictive capabilities. ^{61,
62} In addition, ensemble methods that combine predictions from unsupervised and supervised models can lead to more robust outcomes. For example, a hybrid model that integrates predictions from a clustering algorithm with those from a regression model can yield a better accuracy than either model alone. ⁶³

Handling conflicting outputs from unsupervised and supervised models is a critical challenge in hybrid modelling. Researchers often employ conflict resolution strategies such as voting mechanisms, where the final decision is based on the majority output, or weighted averaging, where outputs are combined based on their reliability or performance metrics. ^{64,
65} This approach allows for more nuanced integration of the models, ensuring that the final output reflects the strengths of both methodologies. In this study, we utilized a weighting method to combine the outputs of supervised and unsupervised algorithms.

In summary, hybrid models that integrate unsupervised and supervised methods offer significant advantages in terms of predictive performance and robustness. By employing effective integration strategies, resolving conflicts between outputs, and utilizing appropriate benchmarks for evaluation, researchers can harness the strengths of both methodologies to address complex challenges across various domains.

Evaluation metrics

In fraud detection studies, various evaluation metrics were employed to assess the performance of the models. Commonly used metrics include accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC). Each metric provides unique insights into the effectiveness of the model in identifying fraudulent activity.

Precision, recall, and F1-score are particularly significant in the context of anomaly detection. Precision measures the proportion of true positive predictions among all positive predictions, indicating the number of flagged instances that were fraudulent. However, recall assesses the proportion of true positives among all actual positives, reflecting the model’s ability to identify all relevant instances. The F1-score is the harmonic mean of the precision and recall, providing a single metric that balances both concerns. In fraud detection, where false positives can lead to unnecessary investigations and false negatives can result in undetected fraud, these metrics are crucial for evaluating the model performance. ^{66,
67} Precision = TP TP + FP

Precision means: Of all predicted positive cases, how many were actually positive. Recall = TP TP + FN

Recall means: Of all actual positive cases, how many were correctly predicted. F 1 − Score = 2 × Precision × Recall Precision + Recall

F1-Score: Harmonic means of Precision and Recall — a balance between the two.

Where:

TP = True Positives

TN = True Negatives

FP = False Positives

FN = False Negatives

The trade-off between accuracy and computational efficiency is critical for fraud detection. While accuracy provides a straightforward measure of overall correctness, it can be misleading in imbalanced datasets, which are common in fraud-detection scenarios where fraudulent cases are rare compared with legitimate ones. Computational efficiency, on the other hand, refers to the time and resources required to train and deploy models. Models that achieve high accuracy may require extensive computational resources, making them less practical for real-time fraud detection applications. Therefore, it is necessary to strike a balance must be struck between achieving high accuracy and maintaining computational efficiency to ensure that the models can operate effectively in real-world environments. ^{66,
67}

AUC-ROC curves are instrumental in assessing model performance, particularly in binary classification tasks such as fraud detection. The ROC (Receiver Operating Characteristic) curve plots the true positive rate against the false positive rate at various threshold settings, allowing for visualization of the trade-off between sensitivity and specificity. The AUC (Area Under the Curve) quantifies the overall ability of the model to discriminate between the positive and negative classes, with values closer to 1 indicating better performance. AUC-ROC is particularly useful in fraud detection because it provides a comprehensive view of the model’s performance across different decision thresholds, aiding in the selection of an optimal threshold for deployment. ^{68–
70}

Several datasets and competitions exist in terms of benchmarks for comparing model results for fraud detection. For instance, Kaggle competitions often provide standardized datasets for benchmarking machine-learning models. Additionally, the UCI Machine Learning Repository includes various datasets relevant to fraud detection, allowing researchers to compare their models with established baselines. These benchmarks facilitate the evaluation of new methods against existing approaches and promote advancements in the field. ^{66,
67}

In summary, the evaluation metrics commonly used in fraud-detection studies include precision, recall, F1-score, and AUC-ROC. Each metric offers valuable insights into the model performance, particularly in the context of imbalanced datasets. The trade-off between accuracy and computational efficiency highlights the need for practical solutions for real-time applications. AUC-ROC curves serve as vital tools for assessing model discrimination capabilities, whereas established benchmarks provide a framework for comparative analysis in the field. The researcher used precision, recall, and F1-score to evaluate the performance of the hybrid model. Additionally, AUC-ROC and MCC (Matthews Correlation Coefficient) values were calculated to obtain insights into the model results.

Methods In dataset description

The creditcard.csv dataset, which is widely utilized in fraud-detection research, is characterized by its focus on credit card transactions, specifically anonymized records from European cardholders. This dataset typically contains features such as transaction time, transaction amount, and various anonymized features derived from PCA (Principal Component Analysis) to protect user privacy. A notable aspect of this dataset is its significant class imbalance, where fraudulent transactions are vastly outnumbered by legitimate transactions, presenting a challenge for machine learning models. ^{30,
71,
72} The dataset consists of approximately 284,807 transactions, with only 492 labelled as fraudulent, highlighting the difficulty of detecting fraud owing to the rarity of positive instances. ^{9,
73}

The quality of datasets significantly affects the performance of hybrid models in fraud detection. High-quality datasets enable more accurate feature extraction and model training, leading to improved detection rates and reduced false positives. ^{20,
74} Conversely, poor-quality datasets can result in overfitting, where models perform well on training data but fail to generalize to unseen data, ultimately undermining their effectiveness in real-world applications. ^{9,
75} Therefore, ensuring high-quality data is essential for developing reliable and efficient fraud-detection systems.

This study employs the publicly available creditcard.csv dataset from Kaggle, which contains anonymized credit card transaction data from European cardholders. The dataset consists of 284,807 transactions, of which 492 are labelled as fraudulent, representing approximately 0.17% of the total data. The features include 28 principal components (V1–V28) derived through Principal Component Analysis (PCA) to preserve privacy, along with the Amount, Time, and Class attributes. The Class variable serves as the binary target label, where 1 indicates fraud, and 0 represents a legitimate transaction.

Data preprocessing and class imbalance

The preprocessing challenges in real-world financial datasets are prevalent and multifaceted. Common issues include handling missing values, addressing class imbalances, and ensuring data privacy and security. ^{2,
6,
76}

As a preprocessing step, The MinMaxScaler technique was used. MinMaxScaler is a widely used data pre-processing technique that transforms numerical features by rescaling them to a specified range, typically between 0 and 1. This scaling method preserves the relationships between the original data values while ensuring that all features contribute proportionately to model training. It is particularly effective for distance-based algorithms and neural networks, which are sensitive to differences in feature magnitude. This helps standardize features such as transaction amounts or time-related attributes, enabling models such as autoencoders to converge more quickly and effectively.

Additionally, researchers have employed Principal Component Analysis ( PCA) as a preprocessing tool for dimensionality reduction. PCA is a widely used technique for dimensionality reduction during anomaly detection. By transforming high-dimensional data into a lower-dimensional space, PCA helps identify patterns and anomalies more efficiently. This is achieved by projecting the data onto the directions of maximum variance, effectively filtering out noise and irrelevant features, which can obscure the detection of anomalies. ^{77,
78}

Furthermore, class imbalance, where legitimate transactions far outnumber fraudulent transactions, complicates the training of machine-learning models, often leading to biased predictions that favor the majority class. ^{72,
79} To address class imbalance, researchers employed the BorderlineSMOTE method to address class imbalance in the creditcard.csv dataset. However, this method is exclusively applied during the training of supervised methods because it adversely affects the unsupervised algorithms.

Results and Discussion

In this study, XGBoost and Random Forest were employed as supervised learning algorithms, whereas the Autoencoder and Isolation Forest were utilized as unsupervised methods to detect anomalies. The data preprocessing pipeline includes MinMax normalization to standardize the feature scales and remove statistical outliers to reduce noise and improve model stability. To address the high dimensionality of the dataset, Principal Component Analysis (PCA) was applied as a dimensionality reduction technique, preserving the most significant variance components.

In addition, BorderlineSMOTE was incorporated into the training process of the supervised models to address class imbalance and improve minority class learning. This technique was particularly beneficial in enhancing the sensitivity of classifiers to fraudulent transactions while also reducing the risk of overfitting to rare fraud instances. Moreover, BorderlineSMOTE contributes to increased robustness against boundary-region vulnerabilities and potential data-poisoning attacks, thereby strengthening the overall generalization capability of supervised components.

As an initial step, we analyzed the performance of each method. The table below ( Table 1) presents the results of the precision, recall, F1-score and accuracy for each method for both normal cases (0) and fraud cases (1).

Table 1. Performance results for XGBoost, RandomForest, Autoencoder, and IsolationForest.

Method	Precision(0)	Precision(1)	Recall(0)	Recall(1)	F1-score(0)	F1-score(1)	Accuracy
XGBoost	0.9999	0.9407	0.9999	0.9250	0.9999	0.9328	0.9998
RandomForest	0.9998	0.9459	0.9999	0.8750	0.9999	0.9091	0.9997
Autoencoder	0.9993	0.5847	0.9994	0.5750	0.9993	0.5798	0.9987
IsolationForest	0.9999	0.0192	0.9230	0.9500	0.9599	0.0376	0.9230

Among the evaluated methods, XGBoost exhibited the best overall performance. It achieves near-perfect results for the majority class (Class 0) with precision (0) = 0.9999 and recall (0) = 0.9999 and maintains a high level of performance in the minority class (Class 1, i.e., fraud cases), with precision (1) = 0.9407, recall (1) = 0.9250, and F1-score (1) = 0.9328. This balance between precision and recall is crucial for fraud detection, indicating that XGBoost not only detects the most fraudulent transactions but also minimizes false alarms. The overall accuracy of 0.9998 further confirms its robustness, although in imbalanced datasets, the accuracy alone is not a sufficient indicator. In conclusion, XGBoost is a top-performing supervised method that effectively manages both false positives and false negatives.

Moreover, Random Forest also demonstrates strong performance in the majority class, similar to XGBoost, with Precision(0) = 0.9998 and Recall(0) = 0.9999. However, it performed slightly lower on the minority class, with recall (1) = 0.8750 and F1-score (1) = 0.9091. This suggests that while Random Forest is highly effective, it may miss a small number of fraud cases compared to XGBoost. Nevertheless, its accuracy of 0.9997 confirms its high reliability. In conclusion, Random Forest is an effective and reliable ensemble method, but slightly less optimal than XGBoost for fraud detection.

In contrast, The Autoencoder, an unsupervised learning method trained on normal data (Class 0), performs exceptionally well on the majority class, with precision (0) = 0.9993 and recall (0) = 0.9994. However, its fraud detection performance was significantly lower, with precision (1) = 0.5847, recall (1) = 0.5750, and F1-score (1) = 0.5798. Although it still detects some anomalies, the model generates a large number of false positives and fails to detect many frauds. In conclusion, the autoencoder is moderately effective as a baseline anomaly detector but lacks precision and recall for minority class identification in isolation.

The isolation Forest produces poor results for fraud detection, with precision (1) = 0.0192 and F1-score (1) = 0.0376, despite a relatively high recall (1) = 0.9500. This suggests that while it flags nearly all frauds (high recall), it generates an extremely high number of false positives (very low precision), making it impractical for real-world fraud detection, where every alert carries a cost. The overall accuracy of 0.9230 was misleadingly high, inflated by the overwhelming presence of normal transactions. In conclusion, the forest isolation method is overly sensitive and lacks practical usefulness for fraud detection in imbalanced datasets.

In high-stakes domains, such as credit card fraud detection, the cost of false positives (customer complaints) and false negatives (missed fraud) must be minimized. Among the models tested, XGBoost provided the best trade-off between fraud detection and noise minimization. Hybrid approaches that combine the sensitivity of unsupervised methods (such as autoencoders) with the precision of supervised learners (such as XGBoost or RF) may offer better results when properly tuned.

Hence, in this study, we tested a hybrid model by combining these four methods (XGBoost, RandomForest, Autoencoder, and IsolationForest) and applied a weight tool as it assigns different importance levels (weights) to the outputs of various models (e.g., Autoencoder, XGBoost, Isolation Forest, etc.) when combining their anomaly scores into a single decision score.

The table below ( Table 2) presents the final performance results after combining the methods and applying the weights. We named the model XRAI, which is the first letter of each method ( XGBoost, RandomForest, Autoencoder, and IsolationForest).

Table 2. Performance comparison between XRAI and other models.

Method	Precision(0)	Precision(1)	Recall(0)	Recall(1)	F1-score(0)	F1-score(1)	Accuracy
XRAI	0.9999	0.9569	0.9999	0.9250	0.9999	0.9407	0.9998
XGBoost	0.9999	0.9407	0.9999	0.9250	0.9999	0.9328	0.9998
RandomForest	0.9998	0.9459	0.9999	0.8750	0.9999	0.9091	0.9997
Autoencoder	0.9993	0.5847	0.9994	0.5750	0.9993	0.5798	0.9987
IsolationForest	0.9999	0.0192	0.9230	0.9500	0.9599	0.0376	0.9230

XRAI is the new proposed model, the name comes from the first letter for each selected method (XGBoost, RandomForest, Autoencoder, and IsolationForest).

The hybrid XRAI model, which integrates the strengths of XGBoost, Random Forest, Autoencoder, and Isolation Forest using a weighted score, demonstrates outstanding anomaly detection capability. It effectively combines supervised and unsupervised methods to balance precision, recall, and generalization, which are crucial in high-stake fraud detection scenarios.

Performance on the majority class (Normal - Class 0)

•

Precision (0) = 0.9999 and recall (0) = 0.9999 indicate near-perfect classification of legitimate transactions.

This means that the model is extremely reliable for minimizing false positives, which is critical for avoiding the disruption of normal customer activity.

•

The F1-score (0) = 0.9999 confirms that there is no trade-off between precision and recall for normal transactions.

Performance on the minority class (Fraud - Class 1)

•

Precision (1) = 0.9569 indicates that when the model flags a transaction as fraudulent, it is correct approximately 96% of the time, which is vital to avoid wasting resources on false alarms.

•

Recall (1) = 0.9250 shows that the model can capture over 92% of all fraudulent transactions, which is an impressive detection rate given the class imbalance and subtlety of the fraud patterns.

•

The F1-score (1) = 0.9407 demonstrates a strong harmonic balance between precision and recall, making the model highly effective for real-world deployment.

Figure 1 illustrates the Receiver Operating Characteristic (ROC) curve for the proposed hybrid anomaly detection model, XRAI (XGBoost, Random Forest, Autoencoder, Isolation Forest). The ROC curve plots the True Positive Rate (recall) against the False Positive Rate (1 - Specificity) across a range of classification thresholds.

Figure 1. Receiver Operating Characteristic (ROC) curve for the proposed model XRAI.

(XRAI: First letters of XGBoost, Random Forest, Autoencoder, Isolation Forest).

The curve shows a steep rise toward the upper-left corner of the plot, which is indicative of a high-performing classifier. The area under the ROC curve (AUC) is 0.9885, suggesting that the model had excellent discriminative capability. An AUC value closer to 1 indicates that the classifier is highly capable of distinguishing between the positive class (fraudulent transactions) and negative class (legitimate transactions).

In summary, the ROC curve and its corresponding AUC of 0.9885 provide strong empirical evidence of XRAI’s ability to effectively separate fraud from non-fraud, even under class imbalance conditions, a critical requirement for robust fraud-detection systems in the financial domain.

The proposed XRAI model, an ensemble combining XGBoost, Random Forest, Autoencoder, and Isolation Forest, achieved a Matthews Correlation Coefficient (MCC) of 94.07%, indicating a strong and balanced predictive performance, particularly in the context of imbalanced classification tasks such as credit card fraud detection.

The XRAI model demonstrates a highly optimized hybrid ensemble for credit card fraud detection. It achieves excellent detection of rare fraudulent cases, while maintaining ultralow false-positive rates. The combination of supervised precision and unsupervised anomaly sensitivity is managed through a weighted mechanism that positions XRAI as a practically deployable solution in real-time financial anomaly detection systems.

Comparison to other similar studies

To contextualize the performance of the proposed hybrid anomaly detection framework, a comparative analysis was conducted with recent studies on credit card fraud detection that utilized similar datasets and evaluation metrics. The objective of this comparison is to demonstrate the relative effectiveness of the proposed model in terms of precision, recall, F1-score, and MCC.

Several studies have explored both the single-model and hybrid approaches using the Kaggle credit card fraud dataset. These models include supervised methods such as Logistic Regression, Random Forest, and XGBoost as well as unsupervised techniques such as Isolation Forest and Autoencoder-based anomaly detectors. In more recent works, hybrid models combining deep learning and ensemble techniques have been proposed to address the limitations of detection accuracy and generalizability.

Table 3, summarizes the selection of comparable studies, outlining the key models used and their reported results. The evaluation metrics used in each study were also included to enable a standardized comparison. Where applicable, the performance of the proposed hybrid model is highlighted to illustrate the improvements over the existing approaches.

Table 3. Comparative performance of proposed model vs. existing studies.

Method	Accuracy	Precision	Recall (TPR)	F1-score	MCC	TNR
Our Proposed Method (XRAI)	0.9998	0.9569	0.9250	0.9407	0.9407	0.9999
Ding et al. (2024) ⁸⁰ - AE + LightGBM (AEELG)	0.921	0.8875	0.3451	0.4722	0.4739
Du et al. (2024) ⁸¹ - AE-XGB-SMOTE-CGAN	0.9993		0.7839		0.8845	0.9997
Alshameri & Xia (2024) ⁸² – VAE		0.93	0.92	0.92
Wu & Wang (2022) ⁸³ - Autoencoder + Adversarial Net	0.9061	0.9216	0.8878	0.9044	0.8128
Lok et al. (2022) ²³ - Hybrid Kmeans -KNN		0.9579	0.7215	0.8231
Ishak et al. (2022) ⁸⁴ - Enhanced Stacking Classifier System	0.9837		0.8841
Benchaji et al. (2021) ⁸⁵ - Attention + LSTM	0.9672	0.9885	0.9191

As shown in the Table 3, the proposed hybrid model achieved superior performance across multiple metrics, attaining the highest accuracy with value of 0.9998, precision of 0.9569 (in top threee) and recall of 0.9250 (top one), resulting in an F1-score of 0.9407 (top one) and MCC of 0.9407 (top one). These results reflect significant advancements over earlier models, particularly in balancing the trade-off between the sensitivity and specificity.

This comparison substantiates the effectiveness of the proposed framework and supports its relevance as a practical, high-performance solution for financial fraud detection.

Real-world applications of the model in financial fraud detection

The findings of this study have significant implications for real-world financial fraud detection, particularly in environments where data are imbalanced, adversarial, and evolving. The proposed hybrid model, XRAI, demonstrated exceptional accuracy and robustness in detecting anomalies in widely used credit cards. csv dataset. By leveraging the strengths of XGBoost, Random Forest, Autoencoder, and Isolation Forest through a weighted scoring mechanism, XRAI offers a holistic and practical approach for identifying fraudulent financial transactions in real-time.

One of the most critical applications of this model is early detection of credit card fraud. Financial institutions are facing increasing threats from sophisticated fraud schemes that are often hidden within massive volumes of transactional data. Traditional models that rely solely on supervised learning struggle with previously unseen and rare types of fraud. By incorporating unsupervised models, such as autoencoders and isolation forests, XRAI can detect previously unclassified anomalies, enabling systems to capture zero-day fraud attacks that evade conventional classifiers.

In addition to fraud detection, this hybrid approach can be adapted for anti-money laundering (AML) systems, insurance fraud detection, and transaction monitoring in e-commerce. Given the adaptability of the model to high-dimensional and noisy data, it can also be used in environments beyond banking, such as healthcare claim validation or cyber intrusion detection, where anomalous patterns are often rare and context dependent.

The practical benefits of this hybrid system extend beyond academic experimentation. It offers a deployable, scalable, and intelligent solution for industries facing complex fraud challenges. As financial crime continues to grow in scale and complexity, systems such as XRAI provide a promising blueprint for building more secure, proactive, and trustworthy fraud detection frameworks.

Challenges in implementation and model limitations

Although the XRAI hybrid model presents a strong case for fraud-detection performance, several limitations emerged during the development and evaluation that must be addressed to fully understand its practical applicability. These limitations can be grouped into three primary categories: data, models, and operational constraints.

First, it relies on the creditcard.csv dataset, which has certain constraints despite its popularity. It is highly imbalanced, anonymized, and preprocessed and does not fully reflect the diversity and noise found in real-world financial data. Features such as merchant category, transaction geolocation, and time-series behavior were not present in this dataset. This limits the generalizability of the model to broader financial environments. Moreover, the dataset lacks adversarial fraud samples that mimic legitimate behavior, which is increasingly common in real financial systems.

Second, the complexity of hybrid architecture introduces challenges in terms of interpretability, maintenance, and scalability. Although the ensemble combines multiple strengths, it also has its weaknesses. For example, autoencoders require careful tuning and are sensitive to reconstruction thresholds, whereas Isolation Forests tend to produce high false-positive rates unless precisely calibrated. Managing the balance of weights across all models adds an additional layer of complexity, particularly when adapting a system to new datasets or changing fraud patterns.

Another limitation is the requirement for labelled data for supervised components, such as XGBoost and Random Forest. Labeling fraud in real-world data is often delayed or incomplete, which can limit the speed of retraining and adaptation. In rapidly changing environments, supervised models become stable unless mechanisms are in place for online or incremental learning.

In summary, although XRAI provides strong fraud-detection performance in experimental settings, its real-world deployment requires careful consideration of data diversity, model manageability, latency, and compliance. Addressing these limitations can further enhance its reliability and adoption.

Conclusion and future work

This study introduced a novel hybrid model, XRAI, designed to enhance the performance and robustness of anomaly detection in credit card fraud-detection systems. By strategically integrating supervised learning algorithms such as XGBoost and Random Forest with unsupervised techniques such as autoencoders and isolation forests, the model effectively overcomes the limitations of single-classifier approaches in highly unbalanced and adversarial environments.

The XRAI model demonstrated strong predictive power across a range of performance metrics, achieving an accuracy of 99.98%, precision of 95.69%, recall of 92.50%, and F1-score of 94.07%. The Matthews Correlation Coefficient (MCC) of 94.07% and AUC of 0.9885 further indicate a high discriminative ability and balanced performance between the fraud and non-fraud classes. These results highlight the model’s potential for real-time deployment in financial institutions aimed at reducing operational risks and minimizing false alarms.

Despite these achievements, the study also acknowledged key limitations, including reliance on a single publicly available dataset (creditcard.csv), the computational cost of the hybrid architecture, and interpretability challenges. These limitations pave the way for further research in this area.

Building on the current findings, future research on the XRAI model can pursue several promising directions to enhance its applicability and robustness in real-world settings. A critical improvement involves incorporating temporal and contextual features, as fraudulent behaviors often manifest as sequential patterns over time. Leveraging techniques such as LSTM-based Autoencoders or Transformer-based architectures can enhance the detection of complex and evolving fraud strategies. Moreover, integrating contextual data, such as customer profiles, merchant categories, and geographic transaction information, can further improve classification accuracy and reduce false positives.

Future studies should focus on adaptive ensemble strategies, explainable AI techniques, and robustness against adversarial attacks. Testing the model across diverse datasets and domains is essential to validate its generalizability and scalability.

In conclusion, the XRAI model presents a scalable, intelligent, and highly accurate solution for credit-card fraud detection. With further refinements in temporal modelling, explainability, and robustness, hybrid models such as XRAI hold significant promise for building trustworthy and resilient fraud detection systems tailored to the ever-evolving landscape of financial crime.

Ethical considerations

Not applicable. This study does not involve human or animal subjects.

Contributions

The contributions of each author are described according to the CRediT (Contributor Roles Taxonomy) system: •

Mohammad Shanaa: Conceptualization; Methodology; Data Curation; Formal Analysis; Software; Validation; Visualization; Writing – Original Draft; Writing – Review & Editing; Project Administration.

Mohammad Shanaa led the design and execution of the research, conducted the data analysis and model development, and prepared the initial and revised versions of the manuscript.

•

Sherief Abdallah: Supervision; Conceptualization; Writing – Review & Editing.

Sherief Abdallah supervised the research process, contributed to refining the methodology and framing the research direction, and provided critical revisions to the manuscript.

Data availability Underlying data

This project is utilizing creditcard.csv dataset which is available on Kaggle website. The dataset is available with license type Database Contents License (DbCL) v1.0.

Users can download the dataset using the following steps: -

Visit: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud

Click on Download option, and select Download dataset option

Extended data

The source code can be accessed from: https://github.com/mohshanaa/XRAI.git

Archived code as time of publication ⁸⁶: https://doi.org/10.5281/zenodo.15626193

License: Creative Commons Attribution 4.0 International

Acknowledgements

This manuscript utilized OpenAI’s ChatGPT (GPT-4) for drafting, linguistic refinement, and grammatical editing. Additionally, scite.ai was employed to identify and evaluate pertinent academic sources. The final content is the result of the author’s original work and the critical analysis.

References 1

Setiawan

Tjahjono

Firmansyah

: Fraud Detection in Credit Card Transactions Using HDBSCAN, UMAP and SMOTE Methods. International Journal of Science, Technology & Management. 2023;4:1333–1339. 10.46729/ijstm.v4i5.929

Noviandy

Idroes

Maulana

: Credit Card Fraud Detection for Contemporary Financial Management Using XGBoost-Driven Machine Learning and Data Augmentation Techniques. Indatu J Manag Account. 2023;1:29–35. 10.60084/ijma.v1i1.78

Shimu Khatun

Rabiul Alam

Taslim

: Handling Class Imbalance in Credit Card Fraud Using Various Sampling Techniques. Am J Multidis Res Innov. 2022;1:160–168. 10.54536/ajmri.v1i4.633

Naidoo

Marivate

: Unsupervised Anomaly Detection of Healthcare Providers Using Generative Adversarial Networks. Hattingh

Matthee

Smuts

, editors. Responsible Design, Implementation and Use of Information and Communication Technology. Cham: Springer International Publishing;2020; vol.12066: pp.419–430. 10.1007/978-3-030-44999-5_35

Peng

Wang

: Unbalanced Data Processing and Machine Learning in Credit Card Fraud Detection. 2022. 10.21203/rs.3.rs-2004320/v1

Gowda

: Credit Card Fraud Detection using Supervised and Unsupervised Learning. Computer Science & Information Technology (CS & IT), AIRCC Publishing Corporation;2021;93–98. 10.5121/csit.2021.111107

Ganji

Chaparala

Sajja

: Shuffled shepherd political optimization-based deep learning method for credit card fraud detection. Concurr. Comput. 2023;35:e7666. 10.1002/cpe.7666

Jain

Arora

Mehra

: Anomaly Detection Algorithms in Financial Data. IJEAT. 2021;10:76–78. 10.35940/ijeat.E2598.0610521

Aslam

: Advancing Credit Card Fraud Detection: A Review of Machine Learning Algorithms and the Power of Light Gradient Boosting. AJCST. 2024. 10.11648/ajcst.20240701.12

Pitsane

Mogale

Rensburg

JJV

: Improving Accuracy of Credit Card Fraud Detection Using Supervised Machine Learning Models and Dimension Reduction. ICONIC. 2022;2022:290–301. 10.59200/ICONIC.2022.032

Saad

Nadher

Hameed

: Credit Card Fraud Detection Challenges and Solutions: A Review. Iraqi J. Sci. 2024;2287–2303. 10.24996/ijs.2024.65.4.42

Zhang

Y-F

H-L

Lin

H-F

: The Optimized Anomaly Detection Models Based on an Approach of Dealing with Imbalanced Dataset for Credit Card Fraud Detection. Mob. Inf. Syst. 2022;2022:1–10. 10.1155/2022/8027903

Zheng

Yang

Xin

: The Credit Card Anti-fraud Detection Model in the Context of Dynamic Integration Selection Algorithm. FCIS. 2024;6:119–122. 10.54097/a5jafgdv

Maheshwari

Osman

Aziz

: A Hybrid Approach Adopted for Credit Card Fraud Detection Based on Deep Neural Networks and Attention Mechanism. ARASET. 2023;32:315–331. 10.37934/araset.32.1.315331

Berhane

Melese

Walelign

: A Hybrid Convolutional Neural Network and Support Vector Machine-Based Credit Card Fraud Detection Model. Math. Probl. Eng. 2023;2023:8134627. 10.1155/2023/8134627

Jiang

Dong

Wang

: Credit Card Fraud Detection Based on Unsupervised Attentional Anomaly Detection Network. Systems. 2023;11:305. 10.3390/systems11060305

Alharbi

Alshammari

Okon

: A Novel text2IMG Mechanism of Credit Card Fraud Detection: A Deep Learning Approach. Electronics. 2022;11:756. 10.3390/electronics11050756

Hajek

Abedin

Sivarajah

: Fraud Detection in Mobile Payment Systems using an XGBoost-based Framework. Inf. Syst. Front. 2023;25:1985–2003. 36258679

10.1007/s10796-022-10346-6

PMC9560719

Liu

: Enhancing Credit Card Fraud Detection on Imbalanced Datasets. HBEM. 2023;21:765–773. 10.54097/hbem.v21i.14759

Airlangga

: Evaluating the Efficacy of Machine Learning Models in Credit Card Fraud Detection. CNAHPC. 2024;6:829–837. 10.47709/cnahpc.v6i2.3814

Murat

Tursunmetova

Nadirov

: MULTI-CLASSIFIERS SYSTEM FOR CREDIT CARD FRAUD DETECTION. BTOUPhMath. 2023;33–47. 10.48081/NMPU3955

Sujitha

Vanitha

: Enhanced Technique for Credit Card Extortion Detection Using Extreme Gradient Boosting Algorithm. MEJAST. 2023;06:35–45. 10.46431/MEJAST.2023.6205

Lok

Abdul Hameed

Ehsan Rana

: Hybrid machine learning approach for anomaly detection. IJEECS. 2022;27:1016. 10.11591/ijeecs.v27.i2.pp1016-1024

Debener

Heinke

Kriebel

: Detecting insurance fraud using supervised and unsupervised machine learning. J. Risk Insur. 2023;90:743–768. 10.1111/jori.12427

Carcillo

Le Borgne

Y-A

Caelen

: Combining unsupervised and supervised learning in credit card fraud detection. Inf. Sci. 2021;557:317–331. 10.1016/j.ins.2019.05.042

Nassif

Talib

Nasir

: Machine Learning for Anomaly Detection: A Systematic Review. IEEE Access. 2021;9:78658–78700. 10.1109/ACCESS.2021.3083060

Benedek

Ciumas

Nagy

: Automobile insurance fraud detection in the age of big data – a systematic and comprehensive literature review. JFRC. 2022;30:503–523. 10.1108/JFRC-11-2021-0102

Fraud Guard: A Comprehensive Comparative Analysis of Machine Learning Approaches to Enhance Credit Card Fraud Detection.

JIEA. 2024. 10.7176/JIEA/14-2-02

Sulaiman

Nadher

Hameed

: Credit Card Fraud Detection Using Improved Deep Learning Models. CMC. 2024;78:1049–1069. 10.32604/cmc.2023.046051

Lai

: Artificial Intelligence Techniques for Fraud Detection. 2023. 10.20944/preprints202312.1115.v1

Adelakun

Onwubuariri

Adeniran

: Enhancing fraud detection in accounting through AI: Techniques and case studies. Financ. Account Res. J. 2024;6:978–999. 10.51594/farj.v6i6.1232

Esmaeili

Cassie

Nguyen

HPT

: Anomaly Detection for Sensor Signals Utilizing Deep Learning Autoencoder-Based Neural Networks. Bioengineering. 2023;10:405. 37106591

10.3390/bioengineering10040405

PMC10136265

Park

Adosoglou

Pardalos

: Interpreting Rate-Distortion of Variational Autoencoder and Using Model Uncertainty for Anomaly Detection. 2020. 10.48550/ARXIV.2005.01889

Fraser

Homiller

Mishra

: Challenges for Unsupervised Anomaly Detection in Particle Physics. 2021. 10.48550/ARXIV.2110.06948

Kim

Y-G

Park

T-H

: Anomaly Detection Using Autoencoder With Feature Vector Frequency Map. IEEE Access. 2021;9:73808–73817. 10.1109/ACCESS.2021.3080330

Zhu

Jiang

Liu

: Fault Detection and Diagnosis in Industrial Processes with Variational Autoencoder: A Comprehensive Study. Sensors. 2021;22:227. 35009769

10.3390/s22010227

PMC8749793

Ikeda

Ouazzane

: New Feature Engineering Framework for Deep Learning in Financial Fraud Detection. IJACSA. 2021;12. 10.14569/IJACSA.2021.0121202

Rosley

Tong

G-K

K-H

: Autoencoders with Reconstruction Error and Dimensionality Reduction for Credit Card Fraud Detection. Haw

S-C

Sonai Muthu

, editors. Proceedings of the International Conference on Computer, Information Technology and Intelligent Computing (CITIC 2022). Dordrecht: Atlantis Press International BV;2022; pp.503–512. 10.2991/978-94-6463-094-7_40

Salekshahrezaee

Leevy

Khoshgoftaar

: The effect of feature extraction and data sampling on credit card fraud detection. J. Big Data. 2023;10:6. 10.1186/s40537-023-00684-w

Lin

T-H

Jiang

J-R

: Credit Card Fraud Detection with Autoencoder and Probabilistic Random Forest. Mathematics. 2021;9:2683. 10.3390/math9212683

Prabha

Priscilla

: Probabilistic XGBoost Threshold Classification with Autoencoder for Credit Card Fraud Detection. IJRITCC. 2023;11:528–537. 10.17762/ijritcc.v11i8s.7234

Gomes

Jin

Yang

: Insurance fraud detection with unsupervised deep learning. J. Risk Insur. 2021;88:591–624. 10.1111/jori.12359

Bulut

Gorgun

: Unsupervised Anomaly Detection in Sequential Process Data: Insights From PIAAC Problem-Solving Tasks. Z. Psychol. 2024;232:74–94. 10.1027/2151-2604/a000558

Mohamed Elmahalwy

Mousa

Amin

: New hybrid ensemble method for anomaly detection in data science. IJECE. 2023;13:3498. 10.11591/ijece.v13i3.pp3498-3508

Feng

Zhang

: Optimizing the Isolation Forest Algorithm for Identifying Abnormal Behaviors of Students in Education Management Big Data. JAIT. 2023. 10.37965/jait.2023.0445

Research Scholar: Department of Computer Science, Karpagam Academy of Higher Education, Coimbatore, 641 021, Tamil Nadu, India, Prajesha TM, Veni S. An Efficient Outlier Detection Using Isolation Forest Based on Robust Scaling and Principal Component Analysis for the Prediction of Anxiety Disorder. IJST. 2023;16:2244–2251. 10.17485/IJST/v16i29.638

Fang

: Anomalous Behavior Detection Based on the Isolation Forest Model with Multiple Perspective Business Processes. Electronics. 2022;11:3640. 10.3390/electronics11213640

Hadi

Tashi

Qureshi

: A Survey on Large Language Models: Applications, Challenges, Limitations, and Practical Usage. 2023. 10.36227/techrxiv.23589741.v1

Meduri

: Cybersecurity threats in banking: Unsupervised fraud detection analysis. Int. J. Sci. Res. Arch. 2024;11:915–925. 10.30574/ijsra.2024.11.2.0505

Liu

Wang

Sun

: Preparation and Optimization of Mesoporous SnO ₂ Quantum Dot Thin Film Gas Sensors for H ₂S Detection Using XGBoost Parameter Importance Analysis. Chemosensors. 2023;11:525. 10.3390/chemosensors11100525

Shi

: Modeling and Evaluation of the Permeate Flux in Forward Osmosis Process with Machine Learning. Ind. Eng. Chem. Res. 2022;61:18045–18056. 10.1021/acs.iecr.2c03064

Wang

Ding

Chen

: Research on the Application of Bayesian-Optimized XGBoost in Minor Faults in Coalfields. Math. Probl. Eng. 2022;2022:1–13. 10.1155/2022/3409468

Nam

Peterson

Seo

: Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis. J. Med. Internet Res. 2021;23:e27344. 34184998

10.2196/27344

PMC8277318

Huang

Yan

Song

: Combining autoencoder with clustering analysis for anomaly detection in radiotherapy plans. Quant. Imaging Med. Surg. 2023;13:2328–2338. 37064364

10.21037/qims-22-825

PMC10102771

Guo

Yuan

Janson

: Older Pedestrian Traffic Crashes Severity Analysis Based on an Emerging Machine Learning XGBoost. Sustainability. 2021;13:926. 10.3390/su13020926

Patel

Singh

Zarbiv

: Mortality Prediction Using SaO ₂/FiO ₂ Ratio Based on eICU Database Analysis. Crit. Care Res. Prac. 2021;2021:1–9. 34790417

10.1155/2021/6672603

PMC8592728

Kujawski

Lee Afanador

: Predicting Measles Outbreaks in the United States: Evaluation of Machine Learning Approaches (Preprint). 2022. 10.2196/preprints.42832

Esmaeilzadeh

Salajegheh

Ziai

: Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning. 2022. 10.48550/ARXIV.2203.02124

Dong

Chen

Peng

: Comparative Study on Supervised versus Semi-supervised Machine Learning for Anomaly Detection of In-vehicle CAN Network. 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). Macau, China: IEEE;2022; pp.2914–2919. 10.1109/ITSC55140.2022.9922235

Bakumenko

Elragal

: Detecting Anomalies in Financial Data Using Machine Learning Algorithms. Systems. 2022;10:130. 10.3390/systems10050130

Aghaee

Krau

Tamer

: Unsupervised Hybrid Models Integrating Deep Autoencoders and Process Controllers’ Models for Enhanced Process Monitoring and Fault Detection. Ind. Eng. Chem. Res. 2024;63:14748–14760. 10.1021/acs.iecr.4c01980

Zhang

Huangfu

Ziada

: A Hybrid Fault Detection Method for Hairpin Windings Integrating Physics Model and Machine Learning. IEEE Access. 2024;12:70392–70404. 10.1109/ACCESS.2024.3402224

Giroh

Kumar

Singh

: Improving the Performance of Hybrid Models Using Machine Learning and Optimization Techniques. Ijmst. 2023;10:3396–3409. 10.15379/ijmst.v10i2.3138

Albahlal

: Emerging Technology-Driven Hybrid Models for Preventing and Monitoring Infectious Diseases: A Comprehensive Review and Conceptual Framework. Diagnostics. 2023;13:3047. 37835793

10.3390/diagnostics13193047

PMC10572974

Chiang

Down

: A decision integration strategy for short-term demand forecasting and ordering for red blood cell components. 2020. 10.48550/ARXIV.2008.07486

Liao

W-W

Hsieh

Y-W

Lee

T-H

: Machine learning predicts clinically significant health related quality of life improvement after sensorimotor rehabilitation interventions in chronic stroke. Sci. Rep. 2022;12:11235. 35787657

10.1038/s41598-022-14986-1

PMC9253044

Ito

Yada

Wakamiya

: Predictive Model for Extended-Spectrum β-Lactamase–Producing Bacterial Infections Using Natural Language Processing Technique and Open Data in Intensive Care Unit Environment: Retrospective Observational Study. JMIR Form Res. 2024;8:e54044. 38986131

10.2196/54044

PMC11269962

Tan

Sun

: Prediction of the Growth Rate of Early-Stage Lung Adenocarcinoma by Radiomics. Front. Oncol. 2021;11:658138. 33937070

10.3389/fonc.2021.658138

PMC8082461

Sun

Han

: Development And Validation Of Models To Predict Cesarean Delivery Among Low-Risk Nulliparous Women At Term: A Retrospective Study In China. 2020. 10.21203/rs.3.rs-44296/v1

K-C

Tau

ENT

Chen

N-C

: Machine Learning Algorithm Predicts Mortality Risk in Intensive Care Unit for Patients with Traumatic Brain Injury. Diagnostics. 2023;13:3016. 37761383

10.3390/diagnostics13183016

PMC10528289

Manda

Kondapalli

Malla

: Imbalanced Data Challenges and Their Resolution to Improve Fraud Detection in Credit Card Transactions. 2024. 10.21203/rs.3.rs-3962043/v1

Esenogho

Mienye

Swart

: A Neural Network Ensemble With Feature Engineering for Improved Credit Card Fraud Detection. IEEE Access. 2022;10:16400–16407. 10.1109/ACCESS.2022.3148298

Sudhakar

Kaliyamurthie

: A Novel Machine learning Algorithms used to Detect Credit Card Fraud Transactions. IJRITCC. 2023;11:163–168. 10.17762/ijritcc.v11i2.6141

Ileberi

Sun

Wang

: Performance Evaluation of Machine Learning Methods for Credit Card Fraud Detection Using SMOTE and AdaBoost. IEEE Access. 2021;9:165286–165294. 10.1109/ACCESS.2021.3134330

Assessing the feasibility of machine learning-based modelling and prediction of credit fraud outcomes using hyperparameter tuning.

ACSS. 2023;7. 10.23977/acss.2023.070212

Trisanto

Rismawati

Mulya

: Effectiveness Undersampling Method and Feature Reduction in Credit Card Fraud Detection. IJIES. 2020;13:173–181. 10.22266/ijies2020.0430.17

Mohammed

A. Bazzi Y. : Implement an Intrusion Detection System Utilizing Machine Learning and Principal Component Analysis. IRJIET. 2024;08:01–07. 10.47001/IRJIET/2024.802001

Ezekiel

Alshehri

Pearlstein

: IoT Anomaly Detection using Multivariate. IJITEE. 2020;9:1662–9. 10.35940/ijitee.D1323.029420

Zhu

Zhang

Gong

: Enhancing Credit Card Fraud Detection: A Neural Network and SMOTE Integrated Approach. JTPES. 2024;4:23–30. 10.53469/jtpes.2024.04(02).04

Ding

Liu

Wang

: An AutoEncoder enhanced light gradient boosting machine method for credit card fraud detection. PeerJ Comput. Sci. 2024;10:e2323. 39650410

10.7717/peerj-cs.2323

PMC11623290

Wang

: A novel method for detecting credit card fraud problems. PLoS ONE. 2024;19:e0294537. 38446831

10.1371/journal.pone.0294537

PMC10917329

Alshameri

Xia

: An Evaluation of Variational Autoencoder in Credit Card Anomaly Detection. Big Data Min. Anal. 2024;7:718–729. 10.26599/BDMA.2023.9020035

Wang

: Locally Interpretable One-Class Anomaly Detection for Credit Card Fraud Detection. 2022. 10.48550/arXiv.2108.02501

Ishak

K-H

Tong

G-K

: Mitigating unbalanced and overlapped classes in credit card fraud data with enhanced stacking classifiers system. F1000Res. 2022;11:71. 10.12688/f1000research.73359.1

Benchaji

Douzi

El Ouahidi

: Enhanced credit card fraud detection based on attention mechanism and LSTM deep model. J. Big Data. 2021;8:151. 10.1186/s40537-021-00541-8

Shanaa

Abdallah

: XRAI: A Hybrid Anomaly Detection Framework for Credit Card Fraud Detection. 2025. 10.5281/ZENODO.15626193

10.5256/f1000research.183325.r432010

Reviewer response for version 1

Palit

Shamik

1 Referee https://orcid.org/0000-0002-2999-2408 1University of Stirling RAK Campus, Ras Al Khaimah, Ras Al Khaimah, United Arab Emirates

Competing interests: No competing interests were disclosed.

27 11 2025

2025

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

Exact hyperparameter configuration for each model.

XGBoost: number of trees, max_depth, learning_rate, subsampling, regularization parameters, etc.

Random Forest: n_estimators, max_features, max_depth, class_weight (if any).

Autoencoder: architecture (layers, hidden sizes, activation functions), optimizer, learning rate, number of epochs, batch size, reconstruction threshold selection.

Isolation Forest: n_estimators, max_samples, contamination, max_features.

Precise description of train/validation/test splitting.

Is there a single held-out test set?

Was cross-validation used? If yes, k-fold or repeated stratified?

How is random seeding handled?

Details of how BorderlineSMOTE is applied.

Confirm explicitly that SMOTE is applied only on the training folds and not on the test set (to avoid data leakage).

Clarify whether SMOTE is applied before or inside cross-validation loops.

Mathematical or algorithmic description of the hybrid XRAI weighting scheme.

How are the outputs of XGBoost, RF, AE, IF combined?

Simple average? Weighted sum? Threshold on each then voting?

How are the weights chosen?

Manually tuned? Based on validation metrics? Grid search?

Conceptually and experimentally, the paper is strong, well-motivated, and technically sound. The main remaining gaps are in the detail level of the methods, particularly around hyperparameters, data splitting, resampling protocol, and ensemble weighting. Once these are clarified, the work will be scientifically solid and reproducible, and in my view, suitable for publication in an applied machine learning or fintech/security venue.

Is the work clearly and accurately presented and does it cite the current literature?

Yes

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Yes

Are the conclusions drawn adequately supported by the results?

Yes

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

Machine Learning

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Shanaa

Mohammad

Computer Science, The British University in Dubai, Dubai, Dubai, United Arab Emirates

Competing interests: No competing interests were disclosed.

7 12 2025

1) Exact hyperparameter configuration for each model

Response:

The full and exact configurations are available in the public GitHub repository https://github.com/mohshanaa/XRAI.git

Direct link

https://github.com/mohshanaa/XRAI/blob/main/Configurations

2) Precise description of train/validation/test splitting

Response:

The study uses a single stratified 70/30 train–test split of the creditcard.csv dataset.

70% of the data is used for model development (training + internal validation as needed).

30% is held out once as the final test set.

Stratification ensures the fraud ratio is preserved in both subsets.

3) Is there a single held-out test set?

Response:

Yes. A single stratified 30% test set was held out and never used during training, resampling, weight tuning, or threshold selection. All reported performance metrics are computed on this untouched 30% test portion.

4) Was cross-validation used?

Response:

No. The experiments used a single 70/30 stratified hold-out split only.

No k-fold, repeated k-fold, or stratified cross-validation was used.

Any internal tuning was performed on the training portion of the split.

5) How is random seeding handled?

Response:

A consistent random_state = 42 was applied to:

the stratified 70/30 split,

BorderlineSMOTE,

stochastic models (Random Forest, XGBoost, Isolation Forest),

and Autoencoder initialization (where applicable).

This ensures full reproducibility. The seed usage is visible in the GitHub code.

6) Details of how BorderlineSMOTE is applied

Response:

The procedure is:

Perform a stratified 70/30 split.

Apply BorderlineSMOTE only to the training 70% subset (for supervised models).

Leave the 30% test set untouched.

Train unsupervised models (Autoencoder, Isolation Forest) on non-SMOTE data, preserving natural anomaly structure.

This avoids leakage and preserves anomaly boundaries for unsupervised models.

7) Confirm SMOTE is applied only on the training set

Response:

We confirm that BorderlineSMOTE is applied exclusively to the training 70% subset.

The 30% test set is never oversampled or modified.

No SMOTE-generated samples ever enter evaluation.

8) Clarify whether SMOTE is applied before or inside cross-validation loops

Response:

Since cross-validation was not used, BorderlineSMOTE was not applied inside any CV loop.

It is applied only once, after the 70/30 split, and only on the training set.

9) How are model outputs combined? Weighted sum? Voting?

Response:

XRAI uses a weighted sum of normalized scores, not voting or averaging.

XGBoost & RF: use predicted fraud probability

Autoencoder: uses normalized reconstruction error

Isolation Forest: uses normalized anomaly score

All four are scaled to ([0,1]), weighted, summed, and thresholded.

10) How were the weights chosen?

Response:

Weights were chosen empirically based on the behavior of the individual models and the overall ensemble performance. Higher weights were assigned to XGBoost and Random Forest, which exhibited stronger precision and stability, while lower weights were assigned to the Autoencoder and Isolation Forest to preserve anomaly sensitivity without allowing noisy alerts to dominate the ensemble. The final chosen weights and their implementation are documented in the publicly available GitHub repository.

10.5256/f1000research.183325.r398688

Reviewer response for version 1

Paldino

Gian Marco

1 Referee https://orcid.org/0000-0002-8680-9403 1Université Libre de Bruxelles, Brussels, Belgium

Competing interests: No competing interests were disclosed.

26 8 2025

2025

recommendation

reject

The manuscript presents a hybrid model, named XRAI, for credit card fraud detection. The model combines two supervised (XGBoost, Random Forest) and two unsupervised (Autoencoder, Isolation Forest) algorithms, reporting superior performance on the public Kaggle creditcard.csv dataset. The problem of fraud detection is of significant practical and academic importance, and the authors' effort to develop a high-performance solution is commendable.

However, the manuscript in its current form suffers from several major methodological and conceptual issues that must be addressed before it can be considered for indexing. The core concerns relate to the justification of the preprocessing pipeline, the rationale for the ensemble's architecture, and the practical significance of the model's contribution.

Major Concerns

Fundamental Flaw in Preprocessing Methodology: A significant methodological concern is the application of Principal Component Analysis (PCA) for dimensionality reduction. The creditcard.csv dataset's primary features (V1-V28) are already the result of a PCA transformation, a fact the authors acknowledge. Applying PCA again to these components is conceptually flawed, as it assumes the components are correlated in a way that allows for further linear dimensionality reduction, which is not guaranteed and highly unusual. This step demonstrates a misunderstanding of the dataset's nature and potentially distorts the data's inherent structure. The authors must either remove this step or provide a strong theoretical justification for this unconventional approach.

Unjustified Ensemble Architecture and Model Selection: The rationale for the specific composition of the XRAI model is unclear and seems arbitrary.

Inclusion of a Poorly Performing Model: The authors' own results (Table 1) show that the Isolation Forest model yields extremely low precision (0.0192) and an F1-score (0.0376) for the fraud class, which the authors rightly identify as "impractical" and lacking "practical usefulness." Its inclusion in the final weighted ensemble is counterintuitive and requires justification. A clear explanation is needed as to why a model known to be highly noisy and generate excessive false positives contributes positively to the final ensemble.

Redundancy of Supervised Models: The framework includes both XGBoost and Random Forest, which are methodologically similar tree-based ensemble methods. The manuscript does not explain the benefit of using both in the final model rather than selecting the single best-performing supervised algorithm. This adds unnecessary complexity without a clear, stated advantage.

Marginal Performance Gain and Novelty of Contribution:

The manuscript claims to propose a "novel" framework. While the specific four-model combination might be new, the general concept of creating a hybrid model by combining supervised and unsupervised learning for fraud detection is well-established in the literature, as cited by the authors themselves (e.g., Carcillo et al., 2021).

Furthermore, the performance gain of the complex XRAI model over its best individual component (XGBoost) is marginal. The F1-score for the fraud class improves from 0.9328 to 0.9407—a gain of less than one percentage point—while the recall remains identical. The authors should clarify the practical significance of this small improvement in light of the model's increased complexity, maintainability, and computational overhead. A simpler ensemble, perhaps combining only XGBoost and the Autoencoder, should be tested and discussed as a more parsimonious alternative.

Insufficient Motivation for Imbalance Handling Technique: The choice of BorderlineSMOTE for handling class imbalance is stated but not motivated. The authors should briefly explain why this specific technique was selected over other common methods (e.g., ADASYN, SMOTE-ENN, random over/under-sampling) and how it is particularly suited for this dataset and model architecture

Generalizability: The framework is not specific to fraud detection and is tested on a single, anonymized dataset. While this is a limitation of the study, the authors could strengthen the discussion by more clearly positioning the framework as a general anomaly detection pipeline and suggesting how it might be adapted with domain-specific features for other applications, and including performance metrics for other publicly available datasets.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Yes

Are all the source data underlying the results available to ensure full reproducibility?

Yes

Is the study design appropriate and is the work technically sound?

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Yes

Reviewer Expertise:

Credit Card Fraud Detection, Time Series Forecasting, Anomaly Detection

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Shanaa

Mohammad

Computer Science, The British University in Dubai, Dubai, Dubai, United Arab Emirates

Competing interests: No competing interests were disclosed.

7 12 2025

We sincerely thank the reviewer for their detailed, insightful, and constructive feedback. We have carefully considered each point and submitted a revised manuscript to clarify the theoretical justification, methodological rationale, and generalizability of the XRAI framework. Below, we provide a point-by-point response to each major concern.

1. Fundamental Flaw in Preprocessing Methodology (PCA Application)

Reviewer Comment:

The creditcard.csv dataset's primary features (V1–V28) are already the result of PCA transformation. Applying PCA again to these components is conceptually flawed… The authors must either remove this step or provide a strong theoretical justification for this unconventional approach.

Author Response:

We thank the reviewer for highlighting this important point. We acknowledge that the creditcard.csv dataset already includes PCA-derived features (V1–V28). In our framework, PCA was not used to perform an additional feature extraction, but rather as a numerical conditioning and normalization step to ensure that all sub-models within the XRAI ensemble operate on a consistent, decorrelated feature space. This approach is supported by previous studies that emphasize the role of PCA and whitening transformations in improving feature orthogonality, numerical stability, and model convergence (Jolliffe & Cadima, 2016; Kessy, Lewin & Strimmer, 2018).

To clarify this, we have explicitly revised the manuscript to state:

“Although the creditcard.csv dataset already contains PCA-derived features, we applied an additional PCA/whitening step purely as a normalization and conditioning layer so that all XRAI sub-models operate on a consistent, decorrelated feature space; this use of PCA/whitening for orthogonalization and numerical stability is well-established in the literature (Jolliffe & Cadima, 2016; Kessy, Lewin & Strimmer, 2018).”

2. Unjustified Ensemble Architecture and Model Selection

Reviewer Comment:

The rationale for the specific composition of the XRAI model is unclear and seems arbitrary.

Author Response:

We appreciate this observation and have expanded the discussion to clarify the theoretical complementarity of the selected models. Each model contributes a distinct strength within the hybrid architecture. To clarify this, we have explicitly revised the manuscript to state:

“The XRAI framework integrates four complementary models, each contributing distinct capabilities that collectively enhance anomaly-detection performance. The supervised learners, XGBoost and Random Forest, establish strong foundational decision boundaries and improve model stability. XGBoost offers high precision and interpretability, providing calibrated scoring that anchors the ensemble’s primary classification behavior, while Random Forest enhances robustness by reducing variance and mitigating overfitting, thereby strengthening generalization. To detect anomalies that supervised models may miss, XRAI incorporates two unsupervised detectors: the Autoencoder, which is highly sensitive to structural irregularities and identifies latent deviations within the data, and the Isolation Forest, which excels at capturing rare or extreme outliers and ensuring broad boundary-level coverage. This structure was inspired by hybrid anomaly-detection principles found in recent research (Carcillo et al., Information Sciences, 2021; Liu et al., HBEM, 2023), which demonstrate that combining precision-oriented supervised models with sensitivity-oriented unsupervised models improves recall without inflating false positives; this justification has been added to the Methods section. Together, these components form a tightly integrated ensemble that delivers a more reliable and comprehensive detection mechanism.”

3. Inclusion of a Poorly Performing Model (Isolation Forest)

Reviewer Comment:

The Isolation Forest yields extremely low precision and F1-score. Its inclusion in the final ensemble is counterintuitive and requires justification.

Author Response:

We agree that Isolation Forest alone performs poorly on imbalanced datasets due to excessive false positives. However, its extremely high recall (0.95) makes it valuable when used in weighted ensemble combination. The low individual precision was penalized during ensemble weighting, but its sensitivity helped ensure that rare fraud instances—often missed by purely supervised models—were not overlooked.

This approach is consistent with ensemble learning literature emphasizing the inclusion of high-recall “weak detectors” to prevent false negatives in rare-event detection tasks (Debener et al., Journal of Risk and Insurance, 2023; Meduri, IJ Sci Res Arch, 2024).

We have clarified this in the revised manuscript:

“In summary, the Isolation Forest algorithm is a robust method for detecting anomalies in financial datasets, particularly effective in high-dimensional spaces, with parameter tuning playing a critical role in optimizing its performance. Its computational efficiency also makes it well-suited for large datasets, and although it can be individually noisy, its high sensitivity to rare and extreme anomalies remains a valuable asset. For this reason, within the hybrid XRAI framework, the Isolation Forest was assigned a relatively low ensemble weight but retained to ensure broad boundary coverage and strengthen robustness against unseen fraud patterns. Despite its limitations, integrating Isolation Forest with complementary methods in a hybrid ensemble significantly enhances overall anomaly-detection capability.”

4. Redundancy of Supervised Models (XGBoost and Random Forest)

Reviewer Comment:

Including both Random Forest and XGBoost adds unnecessary complexity without clear benefit.

Author Response:

We thank the reviewer for raising this valid concern. While both models are tree-based, they exhibit complementary learning biases:

XGBoost reduces bias via gradient boosting, excelling in fine-grained pattern detection;

Random Forest reduces variance via bagging, enhancing stability and resistance to overfitting.

Their combination thus enhances both precision and generalization — a strategy validated in comparative ensemble studies (Murat et al., BTOUPhMath, 2023; Liu, HBEM, 2023). We have revised the “Hybrid Integration” subsection to explain this rationale explicitly.

5. Marginal Performance Gain and Novelty of Contribution

Reviewer Comment:

The model’s performance gain is marginal (<1% F1 improvement) and the novelty claim is overstated.

Author Response:

We acknowledge that the numerical F1 gain appears modest; however, in highly imbalanced domains, even fractional improvements can translate into substantial real-world cost savings. For example, in large-scale financial operations, a 0.8% improvement in fraud detection precision can prevent hundreds of false alerts per million transactions, enhancing customer trust and reducing manual review overhead.

Furthermore, the novelty of the study lies not solely in the combination of four algorithms but in:

The weighted ensemble optimization mechanism that balances supervised and unsupervised outputs.

The open-source reproducibility framework (GitHub) ensuring transparency and reusability.

The focus on real-time operational applicability of hybrid detection pipelines.

These contributions align with F1000Research’s emphasis on reproducibility and practical impact. We have revised the “Conclusion and Future Work” section to emphasize these aspects more clearly.

6. Motivation for Using BorderlineSMOTE

Reviewer Comment:

The choice of BorderlineSMOTE is stated but not motivated.

Author Response:

We appreciate this important point and have added a detailed explanation.

BorderlineSMOTE was selected because fraudulent transactions in credit card data typically occur near class decision boundaries, making standard SMOTE or ADASYN less effective. BorderlineSMOTE focuses on minority samples close to the borderline region, generating more realistic synthetic examples and improving model sensitivity without introducing noise.

This choice follows findings by Han et al. (2005) and recent credit card fraud studies such as Noviandy et al. (2023) and Zhang et al. (2022), which demonstrated superior F1-scores using BorderlineSMOTE for imbalanced fraud data. The rationale and supporting citations have been added to the “Data Preprocessing and Class Imbalance” section.

7. Generalizability and Broader Applications

Reviewer Comment:

The framework was tested on a single dataset and may not generalize well.

Author Response:

We fully agree and appreciate this suggestion. We have revised the Conclusion section to position XRAI as a general anomaly detection framework adaptable to multiple domains beyond fraud detection.

We have added the following text:

“Although validated on the creditcard.csv dataset, XRAI’s architecture is domain-agnostic and can be readily adapted to other anomaly detection contexts such as cybersecurity intrusion detection, insurance fraud, and healthcare anomaly analysis. Future work will involve evaluating the model on diverse benchmark datasets, including IEEE-CIS Fraud Detection and UNSW-NB15, to establish its cross-domain generalizability.”

· Carcillo, F., Le Borgne, Y.-A., Caelen, O., Bontempi, G. (2019). Combining unsupervised and supervised learning in credit card fraud detection. Information Sciences, 557, 317–331. ScienceDirect

· Han, H., Wang, W.-Y. & Mao, B.-H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing (ICIC 2005), LNCS 3644, 878–887. Springer. SpringerLink

· Huang, Y., Wang, S., Hu, Y. & et al. (2021). A robust anomaly detection algorithm based on principal component analysis. Intelligent Data Analysis, 25(6), 1331–1348. https://doi.org/10.3233/IDA-195054. SAGE Journals

· Jolliffe, I.T. & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A, 374(2065), 20150202. https://doi.org/10.1098/rsta.2015.0202. PMC

· Kessy, A., Lewin, A. & Strimmer, K. (2018). Optimal whitening and decorrelation. The American Statistician, 72(4), 309–314. https://doi.org/10.1080/00031305.2016.1277159. IDEAS/RePEc