Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review

Ahmed Hossain

doi:10.12688/f1000research.159632.2

Home Browse Utilizing Machine Learning and causal graph approaches to Address...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Systematic Review

Revised

Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review

[version 2; peer review: 2 approved, 1 approved with reservations]

Ahmed Hossain ^1,2

PUBLISHED 10 Sep 2025

Author details Author details

¹ Healthcare Management, University of Sharjah, Sharjah, Sharjah, United Arab Emirates
² Public Health, North South University, Dhaka, Dhaka Division, 1229, Bangladesh

Ahmed Hossain
Roles: Conceptualization, Methodology, Project Administration, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Public Health and Environmental Health collection.

Abstract

Confounding can significantly distort the findings of studies examining cause-and-effect relationships, especially in etiological research. To mitigate this issue, researchers must carefully assess potential confounding variables that may relate to both the exposure and outcome but are not directly influenced by the exposure itself. It is essential that these variables truly impact the outcome rather than simply being correlated with the exposure to avoid false associations. Strengthening confidence in the actual relationship between exposure and outcome requires an understanding of biological mechanisms and the application of various methods to adjust for confounders. The oversight of confounding often arises from inappropriate statistical tests and the aggregation of data across multiple studies. This scoping review article discusses the challenges posed by confounding and presents machine learning approaches for effective control in health science research. Directed acyclic graphs (DAGs) serve as causal graph tools to identify potential confounding variables in health research. By mapping presumed relationships between variables, DAGs enable researchers to estimate causal effects more accurately. While traditional methods such as randomization, matching, and stratification remain effective for controlling confounding, newer techniques like latent variable modeling with negative controls and machine learning methods such as LASSO, Ridge regression, and random forests offer enhanced flexibility and adaptability.

Keywords

public health; bias; confounding; directed acyclic graphs; correlation; causal effects.

Corresponding author: Ahmed Hossain

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2025 Hossain A. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Hossain A. Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2025, 14:129 (https://doi.org/10.12688/f1000research.159632.2) First published: 27 Jan 2025, 14:129 (https://doi.org/10.12688/f1000research.159632.1) Latest published: 10 Sep 2025, 14:129 (https://doi.org/10.12688/f1000research.159632.2)

Revised Amendments from Version 1

This revision significantly strengthens the manuscript by implementing key enhancements across three critical domains. To bolster methodological transparency, we have elaborated on the systematic review process, providing a comprehensive account of the search strategy, explicit inclusion/exclusion criteria, and a detailed description of the study selection procedure. Furthermore, the statistical rigor has been substantially improved through the incorporation of new comparative analyses and empirical evidence that evaluate machine learning methods against traditional confounder control approaches. Finally, the discussion and conclusions have been fortified to deliver clear, actionable recommendations for researchers and to pinpoint specific, valuable directions for future investigation.

See the author's detailed response to the review by Vipin Vageriya

Introduction

Variables are essential in health research, serving as building blocks for understanding factors influencing health outcomes. They encompass measurable characteristics, attributes, or events that impact health. In health research, distinguishing between dependent and independent variables is fundamental. Often, research involves multiple interacting factors, requiring careful consideration and control to isolate the specific effect of the independent variable on the outcome. For example, in studying the relationship between economic stressors and mental health, variables like economic hardship, financial threat, financial well-being, depression, anxiety, and stress can be considered.¹

Identifying and addressing confounding variables is crucial for ensuring the validity and reliability of research findings in health science. Confounding variables can distort the true relationship between the independent and dependent variables, leading to biased results and inaccurate conclusions.² Failing to account for confounding often occurs due to improper statistical analyses and data aggregation across multiple studies. This emphasizes the pressing need to develop and implement effective control strategies to mitigate the impact of confounding variables in health science research.

Establishing a causal relationship between independent and dependent variables requires careful research design and analysis. Correlation does not imply causation. For example, a study may find a strong association between employment opportunities and mental health symptoms, but this does not necessarily mean that employment causes mental health issues.³ Understanding the distinction between independent and dependent variables is crucial for interpreting research findings in health and other fields. Addressing these challenges ensures the validity and reliability of research findings, which can have significant implications for clinical practice and policy-making.

To minimize the influence of confounding variables, researchers employ various strategies throughout the study design and analysis phases. One key approach is randomization, particularly in clinical trials, where participants are randomly assigned to different intervention groups.⁴ This method helps distribute confounders evenly among the groups, thereby reducing their potential impact. Another strategy is stratification, which involves dividing the study population into subgroups based on the confounding variable and analyzing these subgroups separately.⁵ Matching is also a valuable technique where participants with similar confounding characteristics are paired in different study groups to control for those variables.⁶

Multivariate statistical methods, such as regression analysis, are extensively used to adjust for multiple confounders simultaneously.^7–10 These techniques allow researchers to isolate the effect of the independent variable on the dependent variable while accounting for the influence of confounders. Sensitivity analysis can also be conducted to assess the robustness of study findings to potential confounding. Outcome regression is an innovative way to control for confounding variables by building a statistical model that predicts the outcome variable while accounting for the influence of confounding variables. Standardization involves transforming the data to make the confounding variables comparable across groups.

Implementing control strategies requires careful planning and a thorough understanding of the research context. Researchers must identify potential confounders during the study design phase and select appropriate methods to control them. Transparent reporting of confounder control is crucial for replication and validation. Traditional methods like randomization, matching, and stratification are effective, but newer techniques like latent variable modeling and machine learning offer more nuanced approaches. Directed acyclic graphs can also depict relationships between variables, facilitating unbiased causal effect estimation.

Addressing confounding is vital for the integrity of health science research. By employing robust control strategies, researchers can enhance the accuracy of their findings, contributing to more effective health interventions and policies. As health science evolves, ongoing efforts to refine machine learning methodologies and causal graph approach will be essential in advancing our understanding of complex health phenomena and improving public health outcomes. This scoping review article discussed a few machine learning techniques and causal graph approach to control confounding variables in multivariable data.

Methods

Search strategy

We conducted a scoping review of peer-reviewed articles published between January 1, 2010, and December 31, 2023, in PubMed and Google Scholar databases. The search was limited to English-language articles and focused on identifying confounding variables with causal graph and machine learning approaches in health science research. The search combined controlled vocabulary terms (e.g., MeSH) and free-text keywords related to machine learning, causal graphs, confounding, and health sciences. The review adhered to PRISMA guidelines and is given in https://osf.io/krcxt/.

Selection criteria

This scoping review identified novel methods for addressing confounding variables in health science research: Directed Acyclic Graphs (DAGs) and machine learning techniques (LASSO, Ridge regression, and random forests). Studies were included if they employed these methods for confounder control. Studies were excluded if they did not explicitly address confounding or bias, focused exclusively on simulations without health science applications, or were non-peer-reviewed publications such as editorials, commentaries, or abstracts without full text. Any discrepancies in article selection were resolved by the sole arbiter, AH.

Study selection process

All identified citations were imported into EndNote for deduplication and management. The study selection process involved a two-phase screening: first of titles and abstracts, followed by a full-text assessment of potentially eligible studies. The entire process, including the reasons for exclusion, was documented using a PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram.

Confounding variables

A confounding variable, also known as a confounder, plays a pivotal role in epidemiological studies, influencing the association between the independent variable (the factor under investigation) and the dependent variable (the disease or outcome of interest).^2,11 This additional factor is correlated both with the disease and the independent variable, potentially introducing distortion or masking the true effects of the primary variable on the disease being studied.¹² Confounding factors in a study are not restricted to variables directly impacting both the exposure and outcome; they can exhibit diverse relationships with the exposure and outcome. Such as,

1. A confounder can influence the exposure and the outcome directly.
2. It can affect the exposure and be influenced by another factor that affects the outcome.
3. Alternatively, it can affect the outcome and be influenced by another factor that affects the exposure.

The article “On the definition of a confounder” by VanderWeele and Shpitser (2014) clarifies the concept of confounding and introduces the term “surrogate confounder”.¹³ For example, one studying caffeine intake and heart disease risk failed to consider physical activity, a known confounder. Since individuals who exercise regularly tend to consume more caffeine and have a lower heart disease risk, physical activity serves as a surrogate confounder in the study. However, using physical activity level as a surrogate for the unmeasured confounder may not fully address the confounding effect, leading to potential bias in the study results.

To illustrate a confounding variable, as shown in Figure 1, consider a hypothetical scenario where a researcher investigates the association between coffee consumption and heart disease. The initial hypothesis suggests that coffee drinkers might have a higher prevalence of heart disease compared to coffee non-drinkers. However, a confounding variable, such as smoking, could distort the relationship. It is observed that coffee drinkers in the study also tend to smoke more cigarettes than coffee non-drinkers. Consequently, smoking becomes a confounding variable in this analysis, creating ambiguity about whether the increased heart disease risk is genuinely associated with coffee consumption or if it is attributable to the confounding variable of smoking.

Figure 1. Example of a confounding variable.

This situation underscores the complexity of untangling causation in epidemiological studies, especially in the absence of experimental designs. Due to various constraints like technical, ethical, or financial considerations, researchers often rely on observational studies in public health. Understanding and accounting for confounding variables are critical in these studies to draw accurate and meaningful conclusions about causal relationships. It is essential to conduct meticulous epidemiological studies, carefully considering potential confounders, to inform the development of effective preventive measures in public health.

Here are some more examples of confounding variables:

1. While smoking is strongly associated with ischemic heart disease, income level can be a confounder. High-income individuals might have better access to healthcare and healthier lifestyles, leading to lower heart disease rates, even if there is a positive association between income level and smoke.
2. Research exploring the impact of exercise on depression could be confounded by social support, as individuals with stronger social support are more likely to exercise and have lower depression rates. Social support provides encouragement, motivation, and positive reinforcement, which can help individuals initiate and maintain regular exercise routines.
3. Studies investigating the link between diet and lung cancer risk may face confounding from smoking, as smokers are more likely to have an unhealthy diet and a higher risk of lung cancer. Smoking is associated with other unhealthy lifestyle behaviors, such as poor diet, alcohol consumption, and sedentary lifestyle, these factors may confound the association between diet and lung cancer risk.

Effect of confounding variables in health research

Confounding variables pose a significant challenge in health research by influencing both the independent and dependent variables, potentially distorting the observed relationships, and leading to misleading results, either falsely positive or falsely negative.¹⁴ Upward and downward confounding refer to the direction of bias that occurs when an extraneous variable is not adequately controlled for in a study.⁸

Upward confounding: Upward confounding occurs when a positively associated confounding variable overestimates the effect of the exposure on the outcome, leading to an inflated observed association.¹⁵ In other words, the observed association between the exposure and the outcome appears stronger than it actually is because the confounding variable artificially inflates the effect size. Failure to account for the confounder leads to an overestimation of the true association between the exposure and the outcome.

Downward confounding: Downward confounding arises when a negatively associated confounding variable underestimates the impact of the exposure on the outcome, resulting in a weakened observed association.¹⁶ Failure to adequately address this confounder can obscure or underestimate the true relationship between the exposure and the outcome.

In epidemiology, when discussing the relationship between exposure and outcome, some factors are considered “on the causal pathway.” This means that these factors are intermediate steps through which the exposure leads to the outcome. Adjusting for these factors in statistical analyses could essentially remove the effect of interest because they are part of the sequence of events linking the exposure and outcome.

To illustrate with an example, let’s consider smoking as an exposure and lung cancer as the outcome variable. If we view chronic cough as an intermediate step along the causal pathway between smoking and lung cancer, adjusting for chronic cough in the analysis could potentially obscure the genuine association between smoking and lung cancer. This is because chronic cough is influenced by smoking and, consequently, plays a role in the development of lung cancer.

In summary, when some factors are part of the causal pathway, adjusting for them in statistical analyses may not be appropriate as it can alter the interpretation of the relationship between exposure and outcome.

Machine learning approaches for confounding control

New techniques like latent variable modeling with negative controls, inverse probability of treatment weighting (IPTW) and g-estimation offer more flexibility in applying standardization for confounding control.¹⁷ The method introduces a latent variable to represent unobserved confounding factors and assumes that these factors affect both the exposure and the negative control (a variable unrelated to the outcome) to the same degree.

Machine learning techniques like Least Absolute Shrinkage and Selection Operator (LASSO), Ridge regression, and random forests play a pivotal role in identifying confounding variables and mitigating bias, especially in large healthcare datasets where unmeasured confounding may exist.¹⁸ The article also noted that hybrid methods, combining traditional techniques like stepwise regression, directed acyclic graphs, and knowledge-based approaches with machine learning, showed promising results.

LASSO applies a penalty to the model’s coefficients, resulting in shrinkage and potentially reducing some coefficients to zero.¹⁹ This sparsity characteristic simplifies the model and enhances interpretability by retaining only the most relevant features. Ridge regression also employs regularization to prevent overfitting but differs from LASSO in that it does not reduce coefficients to exactly zero, making it a more stable option when managing highly correlated features.²⁰

Random Forests, on the other hand, evaluate the significance of each feature (or confounder) in predicting the outcome.²¹ They excel at capturing complex, non-linear relationships between variables and are robust against outliers and noise in the data. By employing these diverse strategies, researchers can enhance the reliability and robustness of their findings in health science research.

Machine learning should be viewed not as a replacement for causal theory but as a powerful complement to it. While traditional methods remain sufficient and more interpretable for simpler problems, the growing complexity and high dimensionality of modern data make ML-based approaches, when embedded within a causal framework, more effective for addressing confounding and producing robust causal estimates. The focus is therefore shifting from a dichotomy of ‘ML versus traditional methods’ to the challenge of optimally integrating ML into the causal inference pipeline.

Identifying confounders by causal graphs

Causal graphs, also known as directed acyclic graphs (DAGs), provide a visual representation of the causal relationships between variables in a given system or phenomenon.²² In the context of evaluating potential confounding bias and other biases in epidemiological studies, causal graphs serve as a gold standard tool.²³ By visually mapping out the relationships between variables, including exposure, outcome, and potential confounders, causal graphs allow researchers to assess the likelihood of confounding and other biases affecting their study results.

Researchers can use causal graphs to identify variables that may act as confounders, mediators, or moderators in the relationship between the exposure and outcome of interest. By including these variables in their analyses or adjusting for them appropriately, researchers can control for potential sources of bias and obtain more accurate estimates of the true causal effect. However, DAGs don’t usually show how variables might influence each other indirectly (interactions). This is because they focus on the overall structure of relationships, not the specific details of how strong or curved those relationships might be.

We investigated a hypothetical DAG to adjust potential confounders in investigating the relationship between smoking and ischemic heart disease. This figure was constructed through DAGITTY (http://www.dagitty.net/dags.html#) and is given in Figure 2. This graphical method depicts hypothesized causal relationships and deduces the statistical associations implied by these causal relationships. We consider two potential confounders income level and age. The minimally sufficient adjustment set is the combination of the fewest nodes that, being ancestors of both the exposure and outcome. In this conceptual diagram, each circle represents an individual exposure (‘node’) of theoretical relevance to this hypothesis; each node is interconnected by directional arrows (‘edges’) that represent theoretical associations based on the researchers’ assessment of a priori literature and determination of biological plausibility. Smoking was the main exposure of interest (green node with black border), with ischemic heart disease (blue node with black border) as the outcome of interest. In this instance, all the other exposures (‘nodes’) are theoretically causally associated with (i.e., ancestors of ) both the exposure and the outcome. These ‘adjusted variables’ can then be introduced into the multivariate modelling as potential confounders. In this example, minimal sufficient adjustment is containing Income level in the model for estimating the total effect of Smoking on Ischemic heart disease.

Figure 2. DAG demonstrating causal relationships and potential biasing pathways affecting the association between smoking and ischemic heart.

Identifying confounders using change of an effect size

Historically, researchers used the change in estimate method to identify confounders by observing how the effect size of an exposure changes when potential confounders are adjusted for in the analysis. If the effect size changed substantially, it was considered evidence of confounding.^24–26 Calculating odds ratios in the context of confounding variables involves examining how the association between an exposure and an outcome change when considering the influence of a third variable. The odds ratio (OR) is a statistical measure used in epidemiology and other research fields to assess the strength and direction of association between two categorical variables. It is commonly employed in case-control studies and logistic regression analyses. The odds ratio is calculated as the ratio of the odds of an event occurring in one group to the odds of the same event occurring in another group. Here are a few examples illustrating the impact of confounding variables on odds ratios:

Example 1:

Consider a scenario where a research study explores the link between smoking (exposure) and ischemic heart disease (outcome). However, age emerges as a potential confounding variable since older individuals are more likely to both smoke and develop ischemic heart disease. To ascertain whether age indeed acts as a confounding variable, odds ratios are calculated both with and without considering age.

• Without considering age: The odds of developing lung cancer among smokers are compared to the odds among non-smokers. Let’s assume the calculated value is 5.0, signifying that smoking is associated with a fivefold increase in the risk of lung cancer.
• After considering age: A logistic regression model is applied, incorporating age as a variable. The resulting adjusted odds ratio is determined to be 3.0.

If the odds ratio decreases after adjusting for age, this signals that age plays a confounding role, influencing the association between smoking and lung cancer. In this context, the reduction from 5.0 to 3.0 indicates that age was indeed a confounding variable impacting the observed relationship between smoking and lung cancer.

Example 2:

Suppose research explores the association between regular exercise (exposure) and obesity (outcome). Socioeconomic status (SES) is a confounding variable, as it is linked to both exercise habits and obesity. Now to investigate whether age is a confounding variable we will have to calculated Odds Ratio with and without considering SES variable.

• Without considering SES: The odds of experiencing obesity among those who engage in regular exercise are compared to the odds among those who don’t exercise regularly. Let’s assume the calculated value is 0.7, indicating that regular exercise is associated with a 30% reduction in the risk of weight gain.
• After considering SES: After applying a logistic regression model with SES adjustment we found the odds ratio is 1.2.

If the odds ratio undergoes significant change after adjusting for SES, it implies that SES functions as a confounding variable affecting the association between exercise and obesity. In this context, a substantial change in the odds ratio from 0.7 to 1.2 suggests that SES indeed played a confounding role, influencing the observed relationship between exercise and obesity.

However, researchers are advised against relying on the change in estimate method to identify confounders, especially when dealing with non-collapsible measures like the odds ratio. Non-collapsible measures can introduce bias in estimating the association due to inconsistent effects of confounder adjustment across different strata. Hence, alternative methods or approaches are recommended for identifying and adjusting for confounding in statistical analyses.

Confounder control: Elimination vs. Inclusion

Controlling for confounders in research is crucial to ensure meaningful conclusions from the data.²⁶ Two main approaches exist: elimination and inclusion. Both have their strengths and weaknesses, so choosing the right one depends on specific study and data.

Confounder elimination

The method involves excluding or restricting individuals with confounding characteristics from the analysis.²⁷ For Example, in studying the relationship between smoking and developing lung cancer, it is possible to eliminate individuals with older adults that could affect developing lung cancer. It is simple to implement, reduces potential confounding bias. In another hypothetical study, examining the association between smoking and ischemic heart disease in community adults, age could act as a confounder. To control for age-related confounding during study design, a straightforward approach would be to implement restriction. This might involve limiting the study to adults aged 60 years and older. While restriction can partially address confounding by age, it may limit the generalizability of study findings to other groups.

However, it can lead to smaller sample sizes, reducing generalizability and power. Moreover, it may not eliminate all relevant confounders, potentially missing important effects.

Confounder inclusion

The method includes adding confounding variables as additional predictors in your statistical model.²⁸ For example, in the exercise and weight loss study, we may include variables like income level and access to healthy food alongside exercise in the regression analysis.

The multivariable regression analysis can control for multiple confounders simultaneously, potentially revealing more nuanced relationships.^29,30 Moreover, a latent variable strategy with negative controls helps account for hidden factors affecting both exposure and outcome, leading to more accurate estimates of how prenatal factors influence outcomes.¹⁷

Another example, in a clinical trial testing the effectiveness of a new drug, if researchers suspect that age may confound the results, they could control by inclusion by considering age as an independent variable. Participants might be divided into different age groups, and the impact of the drug could be assessed within each age group separately. This way, the potential influence of age on the results is explicitly considered.

One strength of this method lies in its ability to maintain larger sample sizes, enhancing generalizability and statistical power. However, its successful application necessitates the careful selection of relevant confounders. Additionally, this method demands more intricate analysis, potentially requiring advanced statistical techniques.

Limitations of the study

This study includes inclusion and exclusion criteria determined by a single author, introducing the potential for subjectivity and variability in study selection, as well as researcher bias. This review discusses an overview of the literature rather than an in-depth, critical analysis of individual studies, which limits the depth of evaluation. Additionally, this scoping review does not involve any detailed analysis, restricting its ability to draw definitive conclusions about specific models. The study also did not explore alternative approaches, such as elastic net and boosting, for managing confounding variables, which could have enriched its scope. Despite these limitations, this scoping review provides a valuable foundation for further research, including systematic reviews focused on machine learning techniques for addressing confounding variables. By acknowledging these limitations and adopting rigorous methodological strategies in future studies, researchers can enhance the impact and utility of scoping reviews in this field.

Conclusion

Understanding confounding is crucial for accurate causal inference. By grasping its definitions, control mechanisms, and connections to exchangeability and collapsibility, researchers can design and analyze studies rigorously. To address confounding, meticulous evaluation of potential confounding variables is essential, ensuring they meet all criteria for adjustment without introducing bias. While correlation with the exposure is important, establishing their true influence on the outcome is equally vital to avoid spurious associations. Biological mechanisms and sensitivity analyses can provide valuable insights into the stability of results under different adjustment strategies. Careful assessment and control of confounding in epidemiological studies are crucial for ensuring the accuracy of estimated exposure-outcome associations. Directed acyclic graphs (DAGs) are powerful visual tools in health research. By mapping relationships between variables, DAGs help identify potential confounding factors, leading to more accurate estimates of causal effects. While traditional methods like randomization, matching, and stratification remain valuable, newer techniques like latent variable modeling with negative controls offer even greater flexibility in controlling for confounding. Future research should focus on developing frameworks that seamlessly integrate machine learning with causal inference, ensuring both methodological rigor and interpretability. Emphasis is needed on creating guidelines for selecting when ML offers advantages over traditional approaches, designing hybrid methods that leverage the strengths of both, and validating these approaches across diverse, high-dimensional health datasets. Building user-friendly tools and fostering interdisciplinary collaboration will be essential to translate these advances into practical applications for health science research.

Data availability statement

No data associated with this article.

Reporting guidelines

OSF: PRISMA-P and PRISMA-ScR checklist for “Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review.” https://doi.org/10.17605/OSF.IO/KRCXT.³¹

The project contains the following Reporting guidelines data:

• PRISMA-ScR-Fillable-Checklist_AH.docx

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

References

1. Ali M, Uddin Z, Hossain A: Economic stressors and mental health symptoms among Bangladeshi rehabilitation professionals: A cross-sectional study amid COVID-19 pandemic. Heliyon. 2021; 7(4): e06715. PubMed Abstract | Publisher Full Text | Free Full Text
2. Jager KJ, Zoccali C, Macleod A, et al.: Confounding: what it is and how to deal with it. Kidney Int. 2008; 73(3): 256–260. Publisher Full Text
3. Hossain A, Baten RBA, Sultana ZZ, et al.: Predisplacement Abuse and Postdisplacement Factors Associated With Mental Health Symptoms After Forced Migration Among Rohingya Refugees in Bangladesh. JAMA Netw. Open. 2021; 4(3): e211801. PubMed Abstract | Publisher Full Text | Free Full Text
4. VanderWeele TJ: Principles of confounder selection. Eur. J. Epidemiol. 2019; 34(3): 211–219. PubMed Abstract | Publisher Full Text | Free Full Text
5. Ali M, Uddin Z, Hossain A: Combined Effect of Vitamin D Supplementation and Physiotherapy on Reducing Pain Among Adult Patients With Musculoskeletal Disorders: A Quasi-Experimental Clinical Trial. Front. Nutr. 2021; 8: 717473. PubMed Abstract | Publisher Full Text | Free Full Text
6. Islam M, Sultana ZZ, Iqbal A, et al.: Effect of in-house crowding on childhood hospital admissions for acute respiratory infection: A matched case-control study in Bangladesh. Int. J. Infect. Dis. 2021; 105: 639–645. PubMed Abstract | Publisher Full Text
7. Ali M, Ahsan GU, Hossain A: Prevalence and associated occupational factors of low back pain among the bank employees in Dhaka City. J. Occup. Health. 2020; 62(1): e12131. PubMed Abstract | Publisher Full Text | Free Full Text
8. Chowdhury SR, Kabir H, Mazumder S, et al.: Workplace violence, bullying, burnout, job satisfaction and their correlation with depression among Bangladeshi nurses: A cross-sectional survey during the COVID-19 pandemic. PLoS One. 2022; 17(9): e0274965. PubMed Abstract | Publisher Full Text | Free Full Text
9. Ali M, Ahsan GU, Khan R, et al.: Immediate impact of stay-at-home orders to control COVID-19 transmission on mental well-being in Bangladeshi adults: Patterns, Explanations, and future directions. BMC. Res. Notes. 2020; 13(1): 494. PubMed Abstract | Publisher Full Text | Free Full Text
10. Hossain A, Niroula B, Duwal S, et al.: Maternal profiles and social determinants of severe acute malnutrition among children under-five years of age: A case-control study in Nepal. Heliyon. 2020; 6(5): e03849. PubMed Abstract | Publisher Full Text | Free Full Text
11. Skelly AC, Dettori JR, Brodt ED: Assessing bias: the importance of considering confounding. Evid. Based Spine Care J. 2012 Feb; 3(1): 9–12. PubMed Abstract | Publisher Full Text | Free Full Text
12. Schober P, Vetter TR: Confounding in Observational Research. Anesth. Analg. 2020; 130(3): 635. Publisher Full Text
13. VanderWeele TJ, Robinson WR: On the causal interpretation of race in regressions adjusting for confounding and mediating variables. Epidemiology. 2014; 25(4): 473–484. PubMed Abstract | Publisher Full Text | Free Full Text
14. Grimes DA, Schulz KF: Bias and causal associations in observational research. Lancet (London, England). 2002; 359(9302): 248–252. Publisher Full Text
15. Schuster NA, Rijnhart JJM, Bosman LC, et al.: Misspecification of confounder-exposure and confounder-outcome associations leads to bias in effect estimates. BMC Med. Res. Methodol. 2023; 23(1): 11. PubMed Abstract | Publisher Full Text | Free Full Text
16. Liu L, Hou L, Yu Y, et al.: A novel method for controlling unobserved confounding using double confounders. BMC Med. Res. Methodol. 2020; 20(1): 195. PubMed Abstract | Publisher Full Text | Free Full Text
17. Gustavson K, Davey Smith G, Eilertsen EM: Handling unobserved confounding in the relation between prenatal risk factors and child outcomes: a latent variable strategy. Eur. J. Epidemiol. 2022; 37(5): 477–494. PubMed Abstract | Publisher Full Text | Free Full Text
18. Benasseur I, Talbot D, Durand M, et al.: A comparison of confounder selection and adjustment methods for estimating causal effects using large healthcare databases. Pharmacoepidemiol. Drug Saf. 2022; 31(4): 424–433. PubMed Abstract | Publisher Full Text | Free Full Text
19. Wyss R, van der Laan M , Gruber S, et al.: Targeted learning with an undersmoothed LASSO propensity score model for large-scale covariate adjustment in health-care database studies. Am. J. Epidemiol. 2024; 193(11): 1632–1640. PubMed Abstract | Publisher Full Text | Free Full Text
20. Schneeweiss S, Eddings W, Glynn RJ, et al.: Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases. Epidemiology. 2017; 28(2): 237–248. PubMed Abstract | Publisher Full Text
21. Suk Y, Kang H: Tuning Random Forests for Causal Inference under Cluster-Level Unmeasured Confounding. Multivar. Behav. Res. 2023; 58(2): 408–440. PubMed Abstract | Publisher Full Text
22. Lipsky AM, Greenland S: Causal Directed Acyclic Graphs. JAMA. 2022; 327(11): 1083–1084. Publisher Full Text
23. Tennant PWG, Murray EJ, Arnold KF, et al.: Use of directed acyclic graphs (DAGs) to identify confounders in applied health research: review and recommendations. Int. J. Epidemiol. 2021; 50(2): 620–632. PubMed Abstract | Publisher Full Text | Free Full Text
24. Skelly AC, Dettori JR, Brodt ED: Assessing bias: the importance of considering confounding. Evid. Based Spine Care J. 2012 Feb; 3(1): 9–12. PubMed Abstract | Publisher Full Text | Free Full Text
25. Farmer R, Lawrenson R: Lecture notes in Epidemiology and Public Health Medicine. Blackwell Publishing; 2004; 67–68.
26. Bradbury BD, Gilbertson DT, Brookhart MA, et al.: Confounding and control of confounding in nonexperimental studies of medications in patients with CKD. Adv. Chronic Kidney Dis. 2012; 19(1): 19–26. PubMed Abstract | Publisher Full Text
27. Lipsitch M, Tchetgen Tchetgen E, Cohen T: Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010; 21(3): 383–388. PubMed Abstract | Publisher Full Text | Free Full Text
28. Hossain A, Hossain SA, Fatema AN, et al.: Age and gender-specific antibiotic resistance patterns among Bangladeshi patients with urinary tract infection caused by Escherichia coli. Heliyon. 2020; 6(6): e04161. PubMed Abstract | Publisher Full Text | Free Full Text
29. Schneeweiss S, Eddings W, Glynn RJ, et al.: Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases. Epidemiology. 2017; 28(2): 237–248. PubMed Abstract | Publisher Full Text
30. Hajian Tilaki K: Methodological issues of confounding in analytical epidemiologic studies. Caspian J. Intern. Med. 2012 Summer; 3(3): 488–495. PubMed Abstract | Free Full Text
31. Hossain A Dr.: Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review.2024, December 11. Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 27 Jan 2025

Author details Author details

¹ Healthcare Management, University of Sharjah, Sharjah, Sharjah, United Arab Emirates
² Public Health, North South University, Dhaka, Dhaka Division, 1229, Bangladesh

Ahmed Hossain
Roles: Conceptualization, Methodology, Project Administration, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 10 Sep 2025, 14:129

https://doi.org/10.12688/f1000research.159632.2

version 1

Published: 27 Jan 2025, 14:129

https://doi.org/10.12688/f1000research.159632.1

© 2025 Hossain A. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Hossain A. Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research 2025, 14:129 (https://doi.org/10.12688/f1000research.159632.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 10 Sep 2025

Revised

Views

Reviewer Report 23 Sep 2025

Franklin Akwasi Adjei, University of Wyoming, Wyoming, USA

Approved

https://doi.org/10.5256/f1000research.187911.r411686

The manuscript presents a comprehensive scoping review of methods that are used to identify and control confounding variables in health-related research. The rationale for this review is clearly articulated throughout the manuscript. A confounding variable, which is the distortion of the observed relationship between an exposure and an outcome by an external variable, is a significant challenge in observational and experimental studies. The authors stated the need for addressing confounding to avoid biased conclusions, emphasizing that failure to control confounders can lead to both falsely positive and falsely negative results. The review's objectives are clearly stated, supported by practical examples and hypothetical scenarios, such as the relationship between smoking, coffee consumption, and heart disease. These examples are both practical and common, justifying the study's objectives.
The manuscript has given sufficient methodological detail to allow replication of the review. The author describes the search strategy comprehensively, including the databases they used and the publication times. The criteria for including and excluding articles were also listed, and PRISMA guidelines were followed. Additionally, the manuscript details the statistical approaches used in the included studies, providing explanations that enable readers to understand, reproduce, and critically evaluate the analytical methods.
The statistical analysis and interpretation presented in the manuscript are appropriate and aligned with the objectives of the review. The authors correctly explain the concepts of confounding variables and the implications of failing to adjust for these variables in health-related studies, such as epidemiological ones.
The review's conclusions are well supported by the evidence and discussion. The manuscript highlights the importance of accurately identifying confounders, assessing their effects on both exposure and outcome variables, and using suitable control strategies. The authors also recognize limitations, such as potential subjectivity in study selection, the focus on a scoping rather than an in-depth review, and the omission of some machine learning methods that could offer further insights.
As a remark, this scoping review has certain limitations that could have been addressed, notably the variability among the included studies. Differences in population characteristics, health outcomes, study designs, and statistical methods make direct comparisons difficult and prevent drawing broad conclusions. Additionally, machine learning and causal inference techniques, including deep learning, causal forests, and boosting, are evolving rapidly, and this may have led to some recent advances being overlooked, thereby creating gaps in the review’s scope. A single author carried out the selection process, which can lead to potential bias or subjectivity in choosing studies. Lastly, since the review broadly covers health science research, the relevance of specific methods can vary across fields like epidemiology, clinical trials, or health services research. Thus, some recommendations may not be fully applicable across all health research domains, limiting their overall generalizability and usefulness.
Despite these constraints, the conclusions are grounded in the reviewed evidence, and this makes sure that there is scientific credibility.

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Yes
Are sufficient details of the methods and analysis provided to allow replication by others?

Yes
Is the statistical analysis and its interpretation appropriate?

Yes
Are the conclusions drawn adequately supported by the results presented in the review?

Yes
If this is a Living Systematic Review, is the ‘living’ method appropriate and is the search schedule clearly defined and justified? (‘Living Systematic Review’ or a variation of this term should be included in the title.)

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Public health, environmental health

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 27 Jan 2025

Views

Reviewer Report 08 Sep 2025

Vipin Vageriya, Charotar University of Science and Technology, Changa, Gujarat, India

Approved

https://doi.org/10.5256/f1000research.175391.r411684

Casual graph approach is important to establish/check the relationship between two variables. DAGs are generally used to model dependencies on variables and workflows, such as task scheduling and data processing. The objective of study is clear. The study conclusion is also ... Continue reading

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Yes
Are sufficient details of the methods and analysis provided to allow replication by others?

Yes
Is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are the conclusions drawn adequately supported by the results presented in the review?

Yes
If this is a Living Systematic Review, is the ‘living’ method appropriate and is the search schedule clearly defined and justified? (‘Living Systematic Review’ or a variation of this term should be included in the title.)

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Child Epilepsy, Growth and Development, Parenting, Child Psychology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 12 Sep 2025

Ahmed Hossain, Public Health, North South University, Dhaka, 1229, Bangladesh

12 Sep 2025

Author Response

Thanks
Competing Interests: No
Thanks
Thanks
Competing Interests: No Close
Report a concern
Reviewer Response 27 Sep 2025

Ali Husnain, Chicago State University, Chicago, USA

27 Sep 2025

Reviewer Response
This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer ... Continue reading
This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer computational methods (machine learning and causal graphs). The topic is timely and relevant, and the manuscript succeeds in explaining complex concepts in an accessible way, supported by clear examples.
Strengths

Clear rationale and objective for conducting the review.

Balanced integration of traditional epidemiological techniques with modern ML-based approaches.

Accessible explanations of confounding, DAGs, and ML methods for a broad readership.

Timely contribution to the field given the growing use of ML in public health.

Areas for Improvement

Methodological transparency – The search strategy and inclusion/exclusion criteria should be reported in greater detail for reproducibility. Conducting the review with a single reviewer introduces risk of bias; this needs more discussion or mitigation.

Depth of analysis – The paper functions more as a narrative overview than a critical synthesis. A comparative table summarizing included studies, their methods, and outcomes would enhance rigor.

Statistical rigor – The manuscript would benefit from more comparative or empirical evidence on how ML methods perform against traditional confounder-control techniques.

Scope of ML methods – The focus on LASSO, Ridge, and Random Forests is too narrow. Methods like Elastic Net, boosting, Bayesian models, and double machine learning should be at least acknowledged.

Practical recommendations – The discussion should provide clearer guidance for researchers on when to apply DAGs vs. ML approaches, and highlight potential pitfalls (e.g., collider bias, non-collapsibility).

Minor comments: Improve figure clarity and align terminology with standard epidemiological usage. The abstract could more clearly emphasize the novel contribution of this review.
Recommendation: Accept with major revisions. This article has strong potential but requires expanded methodological detail, deeper comparative analysis, and more practical guidance to maximize its impact.
This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer computational methods (machine learning and causal graphs). The topic is timely and relevant, and the manuscript succeeds in explaining complex concepts in an accessible way, supported by clear examples.
Strengths

Clear rationale and objective for conducting the review.

Balanced integration of traditional epidemiological techniques with modern ML-based approaches.

Accessible explanations of confounding, DAGs, and ML methods for a broad readership.

Timely contribution to the field given the growing use of ML in public health.

Areas for Improvement

Methodological transparency – The search strategy and inclusion/exclusion criteria should be reported in greater detail for reproducibility. Conducting the review with a single reviewer introduces risk of bias; this needs more discussion or mitigation.

Depth of analysis – The paper functions more as a narrative overview than a critical synthesis. A comparative table summarizing included studies, their methods, and outcomes would enhance rigor.

Statistical rigor – The manuscript would benefit from more comparative or empirical evidence on how ML methods perform against traditional confounder-control techniques.

Scope of ML methods – The focus on LASSO, Ridge, and Random Forests is too narrow. Methods like Elastic Net, boosting, Bayesian models, and double machine learning should be at least acknowledged.

Practical recommendations – The discussion should provide clearer guidance for researchers on when to apply DAGs vs. ML approaches, and highlight potential pitfalls (e.g., collider bias, non-collapsibility).

Minor comments: Improve figure clarity and align terminology with standard epidemiological usage. The abstract could more clearly emphasize the novel contribution of this review.
Recommendation: Accept with major revisions. This article has strong potential but requires expanded methodological detail, deeper comparative analysis, and more practical guidance to maximize its impact.
Competing Interests: none Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 12 Sep 2025

Ahmed Hossain, Public Health, North South University, Dhaka, 1229, Bangladesh

12 Sep 2025

Author Response

Thanks
Competing Interests: No
Thanks
Thanks
Competing Interests: No Close
Report a concern
Reviewer Response 27 Sep 2025

Ali Husnain, Chicago State University, Chicago, USA

27 Sep 2025

Reviewer Response
This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer ... Continue reading
This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer computational methods (machine learning and causal graphs). The topic is timely and relevant, and the manuscript succeeds in explaining complex concepts in an accessible way, supported by clear examples.
Strengths

Clear rationale and objective for conducting the review.

Balanced integration of traditional epidemiological techniques with modern ML-based approaches.

Accessible explanations of confounding, DAGs, and ML methods for a broad readership.

Timely contribution to the field given the growing use of ML in public health.

Areas for Improvement

Methodological transparency – The search strategy and inclusion/exclusion criteria should be reported in greater detail for reproducibility. Conducting the review with a single reviewer introduces risk of bias; this needs more discussion or mitigation.

Depth of analysis – The paper functions more as a narrative overview than a critical synthesis. A comparative table summarizing included studies, their methods, and outcomes would enhance rigor.

Statistical rigor – The manuscript would benefit from more comparative or empirical evidence on how ML methods perform against traditional confounder-control techniques.

Scope of ML methods – The focus on LASSO, Ridge, and Random Forests is too narrow. Methods like Elastic Net, boosting, Bayesian models, and double machine learning should be at least acknowledged.

Practical recommendations – The discussion should provide clearer guidance for researchers on when to apply DAGs vs. ML approaches, and highlight potential pitfalls (e.g., collider bias, non-collapsibility).

Minor comments: Improve figure clarity and align terminology with standard epidemiological usage. The abstract could more clearly emphasize the novel contribution of this review.
Recommendation: Accept with major revisions. This article has strong potential but requires expanded methodological detail, deeper comparative analysis, and more practical guidance to maximize its impact.
This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer computational methods (machine learning and causal graphs). The topic is timely and relevant, and the manuscript succeeds in explaining complex concepts in an accessible way, supported by clear examples.
Strengths

Clear rationale and objective for conducting the review.

Balanced integration of traditional epidemiological techniques with modern ML-based approaches.

Accessible explanations of confounding, DAGs, and ML methods for a broad readership.

Timely contribution to the field given the growing use of ML in public health.

Areas for Improvement

Methodological transparency – The search strategy and inclusion/exclusion criteria should be reported in greater detail for reproducibility. Conducting the review with a single reviewer introduces risk of bias; this needs more discussion or mitigation.

Depth of analysis – The paper functions more as a narrative overview than a critical synthesis. A comparative table summarizing included studies, their methods, and outcomes would enhance rigor.

Statistical rigor – The manuscript would benefit from more comparative or empirical evidence on how ML methods perform against traditional confounder-control techniques.

Scope of ML methods – The focus on LASSO, Ridge, and Random Forests is too narrow. Methods like Elastic Net, boosting, Bayesian models, and double machine learning should be at least acknowledged.

Practical recommendations – The discussion should provide clearer guidance for researchers on when to apply DAGs vs. ML approaches, and highlight potential pitfalls (e.g., collider bias, non-collapsibility).

Minor comments: Improve figure clarity and align terminology with standard epidemiological usage. The abstract could more clearly emphasize the novel contribution of this review.
Recommendation: Accept with major revisions. This article has strong potential but requires expanded methodological detail, deeper comparative analysis, and more practical guidance to maximize its impact.
Competing Interests: none Close
Report a concern

Views

Reviewer Report 11 Mar 2025

Ali Husnain, Chicago State University, Chicago, Illinois, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.175391.r368463

This scoping review examines the challenges of confounding in health science research and discusses how machine learning and causal graph approaches can mitigate confounding bias. It highlights traditional confounder control methods such as randomization, matching, and stratification and introduces modern techniques, including directed acyclic graphs (DAGs), LASSO regression, Ridge regression, and Random Forests. The paper emphasizes that effective control of confounding is crucial for ensuring the validity of causal inferences in epidemiological studies.
The author conducted a scoping review of peer-reviewed articles published between 2010 and 2023, using PubMed and Google Scholar. The study follows PRISMA guidelines, with an explicit focus on identifying confounding variables through machine learning models and causal graphs. However, the selection process was conducted by a single reviewer, which introduces potential bias. The paper concludes that emerging computational approaches offer flexibility and adaptability in confounder control but acknowledges that further validation and comparative research are necessary. This article presents a valuable discussion on confounder control using machine learning and causal graphs in health science research. However, to improve scientific rigor and reproducibility, the following revisions are necessary:

Enhance methodology transparency by detailing the search strategy, inclusion/exclusion criteria, and study selection process.
Improve statistical rigor by adding comparative analyses and empirical evidence on machine learning methods versus traditional confounder control approaches.
Strengthen discussion and conclusions by providing clear recommendations for researchers and identifying future research directions.

Addressing these concerns will significantly improve the paper’s credibility and contribution to the field.
Overall Rating: Revise and Resubmit

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Yes
Are sufficient details of the methods and analysis provided to allow replication by others?

Partly
Is the statistical analysis and its interpretation appropriate?

Partly
Are the conclusions drawn adequately supported by the results presented in the review?

Partly
If this is a Living Systematic Review, is the ‘living’ method appropriate and is the search schedule clearly defined and justified? (‘Living Systematic Review’ or a variation of this term should be included in the title.)

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Epidemiology & Public HealthMachine Learning in HealthcareBiostatistics & Causal InferenceSystematic Review Methodologies

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 27 Jan 2025

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 10 Sep 25			read
Version 1 27 Jan 25	read	read

Ali Husnain, Chicago State University, Chicago, USA
Vipin Vageriya, Charotar University of Science and Technology, Changa, India
Franklin Akwasi Adjei, University of Wyoming, Wyoming, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

23 Sep 2025 | for Version 2

Franklin Akwasi Adjei, University of Wyoming, Wyoming, USA

6 Views Cite this report Responses(0)

Approved

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Yes
Are sufficient details of the methods and analysis provided to allow replication by others?

Yes
Is the statistical analysis and its interpretation appropriate?

Yes
Are the conclusions drawn adequately supported by the results presented in the review?

Yes
If this is a Living Systematic Review, is the ‘living’ method appropriate and is the search schedule clearly defined and justified? (‘Living Systematic Review’ or a variation of this term should be included in the title.)

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Public health, environmental health

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

6 Views

08 Sep 2025 | for Version 1

Vipin Vageriya, Charotar University of Science and Technology, Changa, Gujarat, India

6 Views Cite this report Responses(2)

Approved

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Yes
Are sufficient details of the methods and analysis provided to allow replication by others?

Yes
Is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are the conclusions drawn adequately supported by the results presented in the review?

Yes
If this is a Living Systematic Review, is the ‘living’ method appropriate and is the search schedule clearly defined and justified? (‘Living Systematic Review’ or a variation of this term should be included in the title.)

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Child Epilepsy, Growth and Development, Parenting, Child Psychology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (2)

Reviewer Response

27 Sep 2025

Ali Husnain, Chicago State University, Chicago, USA

This scoping review addresses an important methodological issue in health science research: how to effectively identify and control for confounding variables using both traditional approaches (randomization, matching, stratification) and newer computational methods (machine learning and causal graphs). The topic is timely and relevant, and the manuscript succeeds in explaining complex concepts in an accessible way, supported by clear examples.
Strengths

Clear rationale and objective for conducting the review.
Balanced integration of traditional epidemiological techniques with modern ML-based approaches.
Accessible explanations of confounding, DAGs, and ML methods for a broad readership.
Timely contribution to the field given the growing use of ML in public health.

Areas for Improvement

Methodological transparency – The search strategy and inclusion/exclusion criteria should be reported in greater detail for reproducibility. Conducting the review with a single reviewer introduces risk of bias; this needs more discussion or mitigation.
Depth of analysis – The paper functions more as a narrative overview than a critical synthesis. A comparative table summarizing included studies, their methods, and outcomes would enhance rigor.
Statistical rigor – The manuscript would benefit from more comparative or empirical evidence on how ML methods perform against traditional confounder-control techniques.
Scope of ML methods – The focus on LASSO, Ridge, and Random Forests is too narrow. Methods like Elastic Net, boosting, Bayesian models, and double machine learning should be at least acknowledged.
Practical recommendations – The discussion should provide clearer guidance for researchers on when to apply DAGs vs. ML approaches, and highlight potential pitfalls (e.g., collider bias, non-collapsibility).

Minor comments: Improve figure clarity and align terminology with standard epidemiological usage. The abstract could more clearly emphasize the novel contribution of this review.
Recommendation: Accept with major revisions. This article has strong potential but requires expanded methodological detail, deeper comparative analysis, and more practical guidance to maximize its impact.

View more View less

Competing Interests

none

Back to all reports

Reviewer Report

11 Views

11 Mar 2025 | for Version 1

Ali Husnain, Chicago State University, Chicago, Illinois, USA

11 Views Cite this report Responses(0)

Approved With Reservations

Enhance methodology transparency by detailing the search strategy, inclusion/exclusion criteria, and study selection process.
Improve statistical rigor by adding comparative analyses and empirical evidence on machine learning methods versus traditional confounder control approaches.
Strengthen discussion and conclusions by providing clear recommendations for researchers and identifying future research directions.

Addressing these concerns will significantly improve the paper’s credibility and contribution to the field.
Overall Rating: Revise and Resubmit

Are the rationale for, and objectives of, the Systematic Review clearly stated?

Yes
Are sufficient details of the methods and analysis provided to allow replication by others?

Partly
Is the statistical analysis and its interpretation appropriate?

Partly
Are the conclusions drawn adequately supported by the results presented in the review?

Partly
If this is a Living Systematic Review, is the ‘living’ method appropriate and is the search schedule clearly defined and justified? (‘Living Systematic Review’ or a variation of this term should be included in the title.)

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Epidemiology & Public HealthMachine Learning in HealthcareBiostatistics & Causal InferenceSystematic Review Methodologies

Respond to this report

Responses (0)

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Ali M, Uddin Z, Hossain A: Economic stressors and mental health symptoms among Bangladeshi rehabilitation professionals: A cross-sectional study amid COVID-19 pandemic. Heliyon. 2021; 7(4): e06715. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Jager KJ, Zoccali C, Macleod A, et al.: Confounding: what it is and how to deal with it. Kidney Int. 2008; 73(3): 256–260. Publisher Full Text

[3] 3. Hossain A, Baten RBA, Sultana ZZ, et al.: Predisplacement Abuse and Postdisplacement Factors Associated With Mental Health Symptoms After Forced Migration Among Rohingya Refugees in Bangladesh. JAMA Netw. Open. 2021; 4(3): e211801. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. VanderWeele TJ: Principles of confounder selection. Eur. J. Epidemiol. 2019; 34(3): 211–219. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Ali M, Uddin Z, Hossain A: Combined Effect of Vitamin D Supplementation and Physiotherapy on Reducing Pain Among Adult Patients With Musculoskeletal Disorders: A Quasi-Experimental Clinical Trial. Front. Nutr. 2021; 8: 717473. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Islam M, Sultana ZZ, Iqbal A, et al.: Effect of in-house crowding on childhood hospital admissions for acute respiratory infection: A matched case-control study in Bangladesh. Int. J. Infect. Dis. 2021; 105: 639–645. PubMed Abstract | Publisher Full Text

[7] 7. Ali M, Ahsan GU, Hossain A: Prevalence and associated occupational factors of low back pain among the bank employees in Dhaka City. J. Occup. Health. 2020; 62(1): e12131. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Chowdhury SR, Kabir H, Mazumder S, et al.: Workplace violence, bullying, burnout, job satisfaction and their correlation with depression among Bangladeshi nurses: A cross-sectional survey during the COVID-19 pandemic. PLoS One. 2022; 17(9): e0274965. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Ali M, Ahsan GU, Khan R, et al.: Immediate impact of stay-at-home orders to control COVID-19 transmission on mental well-being in Bangladeshi adults: Patterns, Explanations, and future directions. BMC. Res. Notes. 2020; 13(1): 494. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Hossain A, Niroula B, Duwal S, et al.: Maternal profiles and social determinants of severe acute malnutrition among children under-five years of age: A case-control study in Nepal. Heliyon. 2020; 6(5): e03849. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Skelly AC, Dettori JR, Brodt ED: Assessing bias: the importance of considering confounding. Evid. Based Spine Care J. 2012 Feb; 3(1): 9–12. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Schober P, Vetter TR: Confounding in Observational Research. Anesth. Analg. 2020; 130(3): 635. Publisher Full Text

[13] 13. VanderWeele TJ, Robinson WR: On the causal interpretation of race in regressions adjusting for confounding and mediating variables. Epidemiology. 2014; 25(4): 473–484. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Grimes DA, Schulz KF: Bias and causal associations in observational research. Lancet (London, England). 2002; 359(9302): 248–252. Publisher Full Text

[15] 15. Schuster NA, Rijnhart JJM, Bosman LC, et al.: Misspecification of confounder-exposure and confounder-outcome associations leads to bias in effect estimates. BMC Med. Res. Methodol. 2023; 23(1): 11. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Liu L, Hou L, Yu Y, et al.: A novel method for controlling unobserved confounding using double confounders. BMC Med. Res. Methodol. 2020; 20(1): 195. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Gustavson K, Davey Smith G, Eilertsen EM: Handling unobserved confounding in the relation between prenatal risk factors and child outcomes: a latent variable strategy. Eur. J. Epidemiol. 2022; 37(5): 477–494. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Benasseur I, Talbot D, Durand M, et al.: A comparison of confounder selection and adjustment methods for estimating causal effects using large healthcare databases. Pharmacoepidemiol. Drug Saf. 2022; 31(4): 424–433. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Wyss R, van der Laan M , Gruber S, et al.: Targeted learning with an undersmoothed LASSO propensity score model for large-scale covariate adjustment in health-care database studies. Am. J. Epidemiol. 2024; 193(11): 1632–1640. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Schneeweiss S, Eddings W, Glynn RJ, et al.: Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases. Epidemiology. 2017; 28(2): 237–248. PubMed Abstract | Publisher Full Text

[21] 21. Suk Y, Kang H: Tuning Random Forests for Causal Inference under Cluster-Level Unmeasured Confounding. Multivar. Behav. Res. 2023; 58(2): 408–440. PubMed Abstract | Publisher Full Text

[22] 22. Lipsky AM, Greenland S: Causal Directed Acyclic Graphs. JAMA. 2022; 327(11): 1083–1084. Publisher Full Text

[23] 23. Tennant PWG, Murray EJ, Arnold KF, et al.: Use of directed acyclic graphs (DAGs) to identify confounders in applied health research: review and recommendations. Int. J. Epidemiol. 2021; 50(2): 620–632. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Skelly AC, Dettori JR, Brodt ED: Assessing bias: the importance of considering confounding. Evid. Based Spine Care J. 2012 Feb; 3(1): 9–12. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Farmer R, Lawrenson R: Lecture notes in Epidemiology and Public Health Medicine. Blackwell Publishing; 2004; 67–68.

[26] 26. Bradbury BD, Gilbertson DT, Brookhart MA, et al.: Confounding and control of confounding in nonexperimental studies of medications in patients with CKD. Adv. Chronic Kidney Dis. 2012; 19(1): 19–26. PubMed Abstract | Publisher Full Text

[27] 27. Lipsitch M, Tchetgen Tchetgen E, Cohen T: Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology. 2010; 21(3): 383–388. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Hossain A, Hossain SA, Fatema AN, et al.: Age and gender-specific antibiotic resistance patterns among Bangladeshi patients with urinary tract infection caused by Escherichia coli. Heliyon. 2020; 6(6): e04161. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Schneeweiss S, Eddings W, Glynn RJ, et al.: Variable Selection for Confounding Adjustment in High-dimensional Covariate Spaces When Analyzing Healthcare Databases. Epidemiology. 2017; 28(2): 237–248. PubMed Abstract | Publisher Full Text

[30] 30. Hajian Tilaki K: Methodological issues of confounding in analytical epidemiologic studies. Caspian J. Intern. Med. 2012 Summer; 3(3): 488–495. PubMed Abstract | Free Full Text

[31] 31. Hossain A Dr.: Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review.2024, December 11. Publisher Full Text

Utilizing Machine Learning and causal graph approaches to Address Confounding Factors in Health Science Research: A Scoping Review

Abstract

Keywords

Revised Amendments from Version 1

Introduction

Methods

Search strategy

Selection criteria

Study selection process

Confounding variables

Figure 1. Example of a confounding variable.

Effect of confounding variables in health research

Machine learning approaches for confounding control

Identifying confounders by causal graphs

Figure 2. DAG demonstrating causal relationships and potential biasing pathways affecting the association between smoking and ischemic heart.

Identifying confounders using change of an effect size

Example 1:

Example 2:

Confounder control: Elimination vs. Inclusion

Confounder elimination

Confounder inclusion

Limitations of the study

Conclusion

Data availability statement

Reporting guidelines

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated