Development of a tool to assess the risk of bias in statistical simulation studies: study protocol

Sarah J Arnup; Simon L Turner; Matthew J Page; Elizabeth Korevaar; Steve McDonald; Julian PT Higgins; Joanne E McKenzie

doi:10.12688/f1000research.184100.1

Home Browse Development of a tool to assess the risk of bias in statistical simulation...

ALL Metrics

-

Views

Get PDF

Get XML

Export

▬

✚

Study Protocol

Development of a tool to assess the risk of bias in statistical simulation studies: study protocol

[version 1; peer review: awaiting peer review]

Sarah J Arnup¹, Simon L Turner¹, Matthew J Page¹, [...] Elizabeth Korevaar², Steve McDonald³, Julian PT Higgins⁴, Joanne E McKenzie ¹

Sarah J Arnup¹, Simon L Turner¹, [...] Matthew J Page¹, Elizabeth Korevaar², Steve McDonald³, Julian PT Higgins⁴, Joanne E McKenzie ¹

PUBLISHED 30 Jun 2026

Author details Author details

¹ Methods in Evidence Synthesis Unit, Monash University School of Public Health and Preventive Medicine, Melbourne, Victoria, 3004, Australia
² Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health, The University of Melbourne Faculty of Medicine Dentistry and Health Sciences, Melbourne, Victoria, Australia
³ Cochrane Australia, Monash University School of Public Health and Preventive Medicine, Melbourne, Victoria, 3004, Australia
⁴ Population Health Sciences, University of Bristol Medical School, Bristol, England, UK

Sarah J Arnup
Roles: Investigation, Methodology, Project Administration, Writing – Original Draft Preparation

Simon L Turner
Roles: Investigation, Methodology, Writing – Review & Editing

Matthew J Page
Roles: Investigation, Methodology, Writing – Review & Editing

Elizabeth Korevaar
Roles: Investigation, Methodology, Writing – Review & Editing

Steve McDonald
Roles: Methodology, Writing – Review & Editing

Julian PT Higgins
Roles: Methodology, Writing – Review & Editing

Joanne E McKenzie
Roles: Conceptualization, Investigation, Methodology, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS AWAITING PEER REVIEW

Abstract

Background

Statistical simulation studies are the principal methodology for examining the performance of statistical methods. Findings from statistical simulation studies guide researchers in their statistical decision making, and inform evidence syntheses of statistical simulation studies, so should provide a fair representation of how the methods are expected to perform.

Objectives

To develop a tool to assess whether statistical simulation studies provide a fair representation of how statistical methods are expected to perform.

Methods

We will undertake a multi-step process to develop a domain-based tool with signalling questions. The project team will consist of a core working group and an international advisory group. We will hold virtual meetings with the advisory group to agree on the scope and content of the tool. We will conduct systematic reviews and cited reference searches (forward and backward citations) to develop an evidence base to inform domains and signalling questions. An initial set of domains will be agreed on with the advisory group. We will then undertake a survey of methodologists and statisticians with expertise in statistical simulation design and intended users of the tool, to seek their views on which domains are most important. We will propose signalling questions for each domain and revise the domains with feedback from the advisory group until domains are agreed. We will pilot the tool with intended users such as consulting statisticians and systematic reviewers of findings from simulation studies. The developed tool and guidance documentation will be published in an open-access journal and disseminated via conferences and workshops. This protocol has been registered on the Open Science Framework (OSF) on June 2 2026, (Registration DOI: https://doi.org/10.17605/OSF.IO/DW4SM). ¹

Keywords

statistical simulation, risk of bias, bias assessment tool, protocol, systematic review

Corresponding author: Joanne E McKenzie

Competing interests: No competing interests were disclosed.

Grant information: JEM was supported by an NHMRC Investigator Grant (GNT2009612). SJA, SLT, and EK were funded by the Research Support Package of Joanne E McKenzie’s NHMRC Investigator Grant. MJP is supported by a NHMRC Investigator Grant (GNT2033917). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2026 Arnup SJ et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Arnup SJ, Turner SL, Page MJ et al. Development of a tool to assess the risk of bias in statistical simulation studies: study protocol [version 1; peer review: awaiting peer review]. F1000Research 2026, 15:1038 (https://doi.org/10.12688/f1000research.184100.1) First published: 30 Jun 2026, 15:1038 (https://doi.org/10.12688/f1000research.184100.1) Latest published: 30 Jun 2026, 15:1038 (https://doi.org/10.12688/f1000research.184100.1)

Background

Statistical simulation studies are the principal methodology for examining how well a set of statistical methods perform against a known truth. This is achieved by generating hypothetical data sets based on known characteristics, applying the statistical method(s) to the data sets and comparing the results with the known characteristics.²^,³ Statistical simulation can be used to evaluate whether a statistical method performs as intended, to examine the robustness of a statistical method or to compare multiple statistical methods (either as a comparison of previously published methods or to compare the performance of a newly developed method) (see Table 1). Statistical simulation is also used for other purposes, such as to construct empirical sampling distributions (e.g., bootstrapped confidence intervals) or to determine the power or sample size when designing a study, although such uses are beyond the scope of the current project.

Table 1. Example uses of statistical simulation in studies.

Reference	Purpose of simulation study	Description of simulation study
Li, 2024⁴⁴	Evaluate whether a statistical method performs as intended.	Develop a method for small samples that extends standard longitudinal models to accommodate informative observations in clinical studies; simulation was used to show that the proposed estimators performed as expected from theory.
Abbas-Aghababazadeh, 2023⁴⁵	Examine the robustness of a statistical method to violations in the assumptions underpinning the method.	Compare meta-analysis methods for gene-drug associations or biomarker discovery using preclinical pharmacogenomics data; simulation was used to evaluate the performance of the standard meta-analysis methods, which assume independence between included studies, when this assumption was violated.
Jiang, 2023⁴⁶	Compare a newly developed method with multiple previously published statistical methods.	Develop a mediation modelling approach that addresses zero-inflated mediators containing both true zeros and false zeros, and compare this approach to existing standard causal mediation analysis approaches; simulation was used compare the performance of the approaches across a range of scenarios.
Cho, 2024⁴⁷	Compare multiple previously published statistical methods (sometimes referred to as a neutral comparison study).	Compare existing reliability estimators for single-administration test scores; simulation was used to evaluate the accuracy of each estimator under a range of scenarios.

One of the primary purposes of statistical simulation studies is to guide researchers in statistical decision making, for example, in selecting statistical methodology for a particular scenario. For instance, simulation studies might be used to select which small sample correction to make in a cluster randomised trial with few clusters.⁴ Researchers may refer to individual simulation studies, or evidence syntheses that combine the results from multiple simulation studies (e.g., a systematic review of simulation studies that evaluated the properties of small sample corrections in cluster randomised trials⁵). However, both individual statistical simulation studies and evidence syntheses of them may unfairly represent the true performance of the statistical methods, if the simulation results are biased or unrepresentative of the situation to which they are applied.

Bias is defined as a “systematic error, or deviation from the truth, in the results”.⁶ In the context of statistical simulation studies, the ‘results’ are performance metrics used to quantify how well the statistical methods under evaluation are behaving, and these may be evaluated under multiple scenarios (see Table 2 for definitions and examples of terms in bold). Note that ‘bias’ is a key performance metric commonly used in simulations studies to quantify systematic error in statistical method’s estimator relative to the true parameter value (see footnote [1]). To distinguish this ‘bias’ from the bias we are referring to, we use the terminology ‘bias in the simulation study results’.

Table 2. Glossary of terms used in this article to describe elements of a statistical simulation study, with an example from Turner et al. 2021 ^†.⁴⁸

The structure of this glossary follows the ADEMP system from Morris et al.² The terms in brackets refer to alternate usage for the term.

Term	Definition	Example from Turner et al. 2021
Statistical simulation study (study)	A statistical simulation study is a computer experiment designed to evaluate a specific aim, using data created from pseudo-random sampling of known probability distributions.	“In this study, we therefore aimed to examine the performance of a range of statistical methods for analysing interrupted time series studies with a continuous outcome using segmented linear models.”
Data-generating mechanism	The data-generating mechanism is the process of using random numbers to generate (simulate) one or more data sets.²	“We simulated continuous data from ITS studies by randomly sampling from a parametric model [a segmented linear regression model], with a single interruption at the midpoint, and first order autoregressive errors. We multiplied the first error term by $\sqrt{\frac{1}{1 - ρ^{2}}}$ [where $ρ$ is the lag-1 autocorrelation of the errors] so that the variance of the error term was constant at all time points.”
Statistical model (model)	A statistical model describes the assumed mathematical relationship between the data points.	“We use a segmented linear regression model with a single interruption, which can be written using the parameterisation [defined below] proposed by Huitema and McKean as: Y_t = β₀ + β₁t + β₂D_t + β₃[t-T_I] D_t + ε_t”
Variable	The variables in the statistical model are the quantities that can vary across data points.	“Y_t represents the continuous outcome variable at time point t of N time points [t is a variable]. D_t is an indicator variable that represents the post-interruption interval (i.e. D_t = 1 (t ≥ T_I)) where T_I represents the time of the interruption [T_I is a parameter, defined below].”
Parameter	The parameters of the statistical model are the fixed quantities that define the data-generating process.	“The model parameters, β₀, β₁, β₂ and β₃ represent the intercept (e.g., baseline rate), slope in the pre-interruption interval, the change in level and the change in slope, respectively. The error term, ε_t, represents deviations from the fitted model.”
Parameter value (Factor)	Parameter values (Factors) are the values given to the parameters underlying the data and other experimental design choices that the researcher specifies in the data-generating mechanism, (e.g. (true value of a data characteristic such as the mean)).	“We created a range of simulation scenarios including different values of the model parameters and different numbers of data points per series. … All combinations of these parameter values (factors) were simulated, leading to 800 different simulation scenarios.”
Simulation scenario (scenario)	The factors used to specify a data-generating mechanism define a single simulation scenario. There are typically multiple scenarios considered in each statistical simulation study.	“We created a range of simulation scenarios including different values of the model parameters and different numbers of data points per series. … All combinations of these factors were simulated, leading to 800 different simulation scenarios.”
Data set	A data set contains set of data points. Each simulation scenario uses a unique data-generating mechanism to generate multiple data sets.	“Design parameter values (factors) were combined using a fully factorial design with 10,000 data sets generated per combination.”
Data point	A data point is a single observation, case or record within the data set.	An example data point is t = 20 months, D_t = 1, Y₂₀ = 0.50 C. difficile infections per 1,000 patient-days.
Estimand	The estimand is a population quantity, or true characteristic of the data, that is estimated by the statistical methods in the statistical simulation study.	“The primary estimands of the simulation study are the parameters of the model, β₂ (level change) and β₃ (slope change).”
Statistical method (method)	Statistical method typically refers to a model used for data analysis but can also refer to the procedure used to choose an analysis.² A statistical simulation study may evaluate the performance of a single method or compare the performance of multiple methods.	“We focus on statistical methods that have been more commonly used (Ordinary Least Square (OLS), Generalised Least Squares (GLS), Newey-West (NW), Autoregressive Integrated Moving Average (ARIMA)). In addition, we have included Restricted Maximum Likelihood (REML) (with and without the Satterthwaite adjustment), which although is not a method in common use, is included because of its potential for reduced bias in the estimation of the autocorrelation parameter, as has been discussed for general (non-interrupted) time series.”
Performance metrics	The performance metrics generate the numerical quantities, i.e., the results, used to assess the performance of the method(s) under evaluation (e.g., bias, confidence interval coverage, mean square error).	“The performance of the methods was evaluated by examining bias*, empirical standard error, model-based standard error, 95% confidence interval coverage and power.”

* Bias in this context refers to the difference between the expected value of the estimator for the estimand and the true value of the estimand.

† Some of text of the original paper has been modified to align with the terminology of the present paper.

Bias may occur in the simulation results when, for example, researchers alter the study design after seeing the initial results to favour a preferred method.⁷^–¹⁰ Additionally, the composition of the research team can introduce bias if the researchers have varying expertise, experience or preferences regarding the methods being compared.⁷^,¹¹^–¹⁴ This could result, for example, in study design decisions that favour particular methods,⁷^–¹¹^,¹³^,¹⁴ or identification of implementation errors more readily for some methods over others.⁷^,¹¹^,¹²

Many choices are made when designing a simulation study, and while some of these choices may not introduce bias into the simulation results, they can still lead to unfair representation of the performance of the methods under evaluation. For example, researchers must choose the statistical methods to be compared, the data-generating mechanism and performance metrics, and the approach used to evaluate the performance of the methods when the simulation results are missing (as occurs, for example, when methods fail to converge for some of the data sets in a particular scenario).²^,³^,⁸^,¹¹^,¹⁵^,¹⁶ Different design choices can lead to different findings for the performance of a statistical method, which in turn can lead users of statistical simulation studies to make a different decision when selecting statistical methodology. Such researcher choices can also lead to optimism bias, which refers to the tendency of a newly introduced method to perform better in the original publication than in subsequent comparison studies.⁷^,¹⁴^,¹⁷^–¹⁹

Non-reporting bias is another type of bias that is of concern for simulation studies. Simulation studies may remain entirely unpublished due to their findings (known as publication bias),¹¹^,²⁰^–²² or there may be selective non-reporting of results within individual studies (known as selective non-reporting bias).⁷^,¹¹^,²² Unlike other forms of research (e.g., randomised trials), there is not a culture or requirement to register simulation studies or publish protocols for them, and because simulation studies (mostly) do not include patient data, they are exempt from ethical review. These factors hinder the ability to identify a sample of simulation studies before their results are known and hence to assess the risk of bias due to non-reporting.

There exist several risk of bias tools for a range of study designs such as RoB 2 for randomised trials,²³ ROBINS-I for non-randomised studies of interventions,²⁴ ROBIS for systematic reviews,²⁵ and PROBAST for prediction modelling studies.²⁶ However, we are unaware of such a tool for assessing the risk of bias in the results and unfair representation of the performance of the methods in statistical simulation studies. While reporting guidelines and recommendations for the developers of statistical simulation studies exist,²^,³^,¹³^,¹⁵^,²⁷^–³⁰ these publications are not aimed at the users of simulation studies. Assessing potential for bias in findings from simulation studies is important for researchers using the findings to guide their statistical decision making, and for those undertaking evidence syntheses of statistical simulation studies. The aim of this research is therefore to develop a tool to evaluate whether statistical simulation studies provide a fair representation of how the statistical methods under investigation are expected to perform.

Methods

We aim to develop a tool to evaluate whether statistical simulation studies provide a fair representation of how statistical methods are expected to perform. Development will be based on guidance by Whiting et al.³¹ and Moher et al.,³² and informed by methods used in the development of other related tools.²⁶^,³³^–³⁶ We describe the activities planned at each proposed stage. The development process for the tool is shown in Figure 1.

Figure 1. Development process for the risk of bias tool.

1. Assemble the team

The project team will consist of a core working group and an international advisory group. The core group includes: Sarah Arnup (co-lead), Joanne McKenzie (co-lead), Simon Turner, and Matthew Page, based at the School of Public Health and Preventive Medicine at Monash University, and Julian Higgins (University of Bristol). The core group consists of researchers with expertise in the design, conduct and analysis of statistical simulation studies; experience in using statistical simulation studies to inform their statistical practice; knowledge of bias in different study designs; and experience developing risk of bias tools (e.g., RoB2,²³ ROB-MEN,²² ROB-ME³⁷). The core group will be responsible for leading the tool development and undertaking the research (e.g., systematic review to inform the content of the tool, conduct and analyse the survey used to generate potential items for the tool, organising consensus meetings).

An international advisory group will be established to provide advice to the core working group throughout the development process. The advisory group will consist of international interest holders including methodologists and statisticians with expertise in numerical simulations, and likely users of the tool (e.g., consulting statisticians, systematic reviewers of findings from simulation studies).

2. Define the scope and outline the conceptual decisions for the development of the tool

During the first virtual meeting with members of the core and advisory group, we will seek agreement on the conceptual decisions outlined in Table 3.

Table 3. Key definitions and conceptual decisions to make for defining the scope of a risk of bias tool (adapted from Whiting, 2017).³¹

Conceptual decisions to make	Considerations and examples
What is the definition of a statistical simulation study?	We will seek agreement on the definition of a statistical simulation study. The terminology used to refer to a statistical simulation study varies with discipline, and can include, for example, simulation study, Monte Carlo simulation, stochastic simulation, computer simulation, numerical simulation study, computational research and benchmark/ing study. Examples of published definitions for a statistical simulation study include: • “Simulation studies are computer experiments that involve creating data by pseudo-random sampling from known probability distributions.”² • “Simulation studies use computer intensive procedures to test particular hypotheses and assess the appropriateness and accuracy of a variety of statistical methods in relation to the known truth.”³ • Monte Carle Simulation: “Simulation that uses repeated random sampling to obtain results; the random sampling may be pseudorandom, implemented via a computer.” ⁴⁹
Which applications of statistical simulation studies will be targeted by the risk of bias tool?	We propose to restrict the tool to studies evaluating statistical methods but will seek agreement on the exclusion of other uses of statistical simulation. Statistical simulation studies are used for different purposes. Common aims of a statistical simulation study, including studies that use different terminology but meet the definition of a statistical simulation study, can include both studies that *evaluate* statistical methods and *apply* statistical methods. (See also Table 1): Evaluation Example aims of studies that evaluate whether a statistical method performs as intended: • To check the algebra (and code) when a new method has been derived algebraically.² • To assess the relevance of large sample theory in finite samples.²^,⁵⁰ • To assess whether a method performs as expected in data where the underlying parameters are consistent the parameters the method was designed for.²^,⁵⁰ Example aims of studies that examine the robustness of a statistical method: • To understand where a well-used method breaks down, and how robust the method is to violation of the underlying parameter assumptions.²^,⁵⁰ Example aims of studies that compare multiple statistical methods: • To compare the relative performance of multiple methods in the application of interest, either to demonstrate the performance improvements or other advantages of new methods (superiority), or to systematically compare existing methods (neutral comparison).³⁰ • To compare the relative performance of multiple methods in broad, practically relevant settings, either to demonstrate the performance improvements or other advantages of new methods (superiority); or to systematically compare existing methods (neutral comparison).²^,¹³^,¹⁹^,⁵⁰ Application Example aims of studies that apply simulation methods: • To calculate the sample size or power provided by a study design.² • To construct empirical estimations of sampling distributions, e.g., bootstrapped confidence intervals. • To provide instructional tools to help with understanding statistical concepts.³
How is risk of bias defined?	We will seek agreement on the definition of the risk of bias. There are at least two options of what could be assessed for risk of bias in a simulation study. Firstly, risk of bias of the individual results of performance metrics; that is, deviation in the numerical quantities used to estimate the performance of the methods under evaluation, from the results that would have been reached in a study with no flaws in the design, conduct or analysis. Secondly, risk of bias of the overall conclusions that the simulation study authors draw about the performance of the methods.
Will the tool consider only risk of bias of the results (internal validity) or will it also be concerned with assessing applicability (external validity) and possibly reporting quality and missing results?	We will seek agreement on whether the tool should consider only risk of bias in the simulation study results, or in addition, assess broader issues of unfair representation of the performance of the methods. Consideration of unfair representation of the methods will include an assessment of the potential for design choices to misrepresent methods, in addition to bias in the simulation results (i.e., internal validity). Examples of practices that may introduce bias into the simulation results or misrepresent the methods (presented in italics) are provided below: 1. Neutrality of the authors - When the expertise, experience and preference of the authors varies across each method under consideration, then bias may arise because one (or more) methods may be configured, coded and evaluated differently compared with the other methods under consideration, and not perform as it would in a fair comparison. 2. Blinding of the authors – When the same researchers configure, code, debug and evaluate the methods under consideration, the authors may be more motivated to investigate unexpected (or undesirable) performance in a preferred method, leading to bias in the estimates of the performance. 3. Selection of the statistical methods - When the methods under consideration in a study are not representative of the methods that are used in practice, for a given application and research question, the results may not be generalisable (or have external validity). The conclusions reached about the superiority of a new method relative to existing methods may not be replicated in subsequent studies. 4. Data generating mechanism - When simulation scenarios are not representative of the intended application, or chosen to favour a preferred method, the results may not be generalisable (or have external validity). 5. Seed setting and random number generation – When the random number generation and seed setting procedure do not ensure appropriate independence between data sets, bias may arise in the metrics used to quantify performance of the methods under consideration. 6. Parameter tuning and software version – When the statistical methods under consideration require input parameters from the researcher, and these parameters are selected to fit the simulation scenarios in only the preferred method, or when different software versions are used for each method, bias may arise in the metrics used to quantify performance of the methods under consideration. 7. Handling of missing values – The treatment of data sets that fail to produce an outcome when analysed by the methods under consideration, e.g., due to non-convergence of the methods, may lead to bias in the metrics used to quantify performance of the methods. 8. Performance metrics – When the performance metrics are chosen to favour a preferred method, or a limited number of performance metrics are used, the results may not be generalisable (or have external validity). 9. Selective reporting - When there is selective reporting of, e.g., simulation scenarios, comparison statistical methods or performance metrics that favour a preferred method, then bias may arise in the metrics used to quantify the performance of the methods under consideration. 10. Alteration of the study design after seeing the results – When aspects of the study design, e.g., data-generating mechanism, comparison statistical methods, performance metrics, are changed after seeing the results of the performance metrics, bias may arise in the metrics used to quantify performance of the methods under consideration.
Who is the target audience?	We have decided to develop a tool for researchers who use the findings of statistical simulation studies to guide their statistical decision making (e.g., consulting statisticians), and researchers undertaking evidence syntheses of statistical simulation studies.
What type of tool structure will be adopted, e.g., simple checklist design or a domain-based approach?	We have decided to develop a domain-based tool with signalling questions, as per, for example, RoB2,²³ ROBIS.²⁵
How will quality items be rated within the tool?	We will seek agreement on the response options for the signalling questions and the domains. As a starting point, we will consider using the response options used for other risk of bias tools; that is high/low/some concerns for the domain, and yes/no/unclear or yes/probably yes/probably no/no/no information for the signalling questions.

3. Develop the evidence base to inform domains and signalling questions

We will identify items to inform the content (e.g. domains and signalling questions) for the proposed tool by undertaking a systematic review. We will obtain evidence from four article types: statistical simulation studies (type 1); protocols for statistical simulation studies (type 2); articles that provide guidance, tutorial, commentary or evidence for unfair representation of the methods in statistical simulations (type 3); and systematic reviews that include statistical simulation studies (type 4). Details of the eligibility criteria for each article are available in Appendix Table A1.¹

3.1. Search strategy

The Ovid MEDLINE search strategies have been iteratively developed with the assistance of an experienced information specialist (SM). We have designed a base search strategy that is highly sensitive for statistical simulation studies (see Appendix Table A2¹). This base search strategy was developed and tested using a set of 32 articles; 8 articles were obtained from a recent study examining the replicability of highly cited statistical simulation studies,³⁸ 6 articles from statistical simulation studies published in Biometrical Journal as part of the collection “Neutral comparison studies in methodological research”,³⁹ and 20 articles from a convenience sample of articles identified by early iterations of the search strategy. Because the search strategy is expected to return an infeasible number of articles to screen, we will combine the base strategy with more focussed search terms for article types 1, 2 and 4. For article type 3, we will identify articles by conducting a cited reference search (forward and backward citations) of a set of articles known to the authors. Descriptions and rationale for the search strategies, and the search syntax (where applicable), are available in Appendix Table A2.¹

3.2. Selection of studies

Citations identified from the search will be imported to Microsoft Excel (Microsoft Office LTSC Professional Plus 2021). One author (SA) will screen all abstracts against the eligibility criteria and classify them as eligible, ineligible or potentially eligible. Full-text articles will be retrieved for all abstracts classified as eligible and potentially eligible articles. Where eligibility is unclear, the article will be reviewed by the core group. For type 1 (statistical simulation) articles, and for reasons of feasibility, we will randomly select and screen abstracts and full-text articles until we identify 50 eligible studies. Eligible articles will be imported to EndNote X8 (Clarivate Analytics, Philadelphia) to remove duplicates.

3.3. Data extraction and management

Data will be extracted from eligible articles using a data collection form developed in Research Electronic Data Capture (REDCap) online designer.⁴⁰^,⁴¹ The core working group will pilot the data extraction form for article types 3 and 4. For article type 3, the core working group will independently extract data from the same set of articles because we anticipate subjectivity in identifying and categorising data for extraction. For article type 4, two authors, SA and another (EK, SLT, MJP or JEM), will independently extract data from a set of articles. SA will identify discrepancies from the piloting and present these at meetings for discussion. The data extraction form and guidance will be refined through this process. No piloting of article types 1 and 2 will be undertaken because the guidance developed for articles type 3 and 4 will also apply to articles types 1 and 2. The remainder of articles will be extracted by one author (SA), with any uncertainties discussed with the core working group.

We will extract data on study characteristics (Appendix Table A3¹). For all article types we will identify whether the authors discuss a potential source of bias in the simulation study results; state or discuss potential flaws in the design of the study; state or discuss a design, conduct or reporting practice which signals, mitigates or allows the assessment of flaws in the design of the study, or provide empirical evidence for bias in statistical simulation studies. We will extract the quote, and write a sentence summarising the concept (which we call an item), categorise the item according to the reason for inclusion and select practises that could lead to the type of bias identified in the item (e.g., we extract the quote: “In case of simulated data, organize a fair comparison in terms of the relation between the methods under study and the data-generating mechanisms of the simulations, with fair meaning that one should not exclusively rely on mechanisms that unilaterally favour methods which explicitly or implicitly assume that these mechanisms are in place”; summarise the concept: “The data-generating mechanism should produce data that allows a fair assessment of the performance the methods under investigation”; categorise the quote: “Discuss a potential source of bias in the results of the simulation study” and we select the practice leading to this bias as: “Data-generating mechanism.”) This latter selection is important for informing potential signalling questions.

3.4. Code and categorise items into domains and signalling questions for potential inclusion in the tool

One author (SA) will: (1) group the items by the types of practices that could lead to bias in the simulation study results (items may be grouped under multiple practices); (2) synthesize and reword similar items under each practice to create a unique set of items; (3) categorise the practices into broad domains of bias (e.g., selective outcome reporting), generalisability (e.g., choice of simulation scenarios), and an unclear category for further discussion. We will provide a definition for each domain, provide a rationale for including the domain, and give example signalling questions. This initial draft set of domains will then be distributed to the advisory group for review. Analyses of the extracted data will be undertaken in Stata version 18.0⁴² and Microsoft Excel (Microsoft Office LTSC Professional Plus 2021).

4. Hold meeting(s) to identify domains for inclusion in the tool

Following distribution of the initial set of domains, we will hold virtual consensus meeting(s) with the advisory group to discuss which domains should be retained, which require modification, and whether there are any missing domains. We will also seek feedback on the definitions of the domains. The domains and their definitions will be revised in response to this feedback and used as the basis for the subsequent survey of interest holders.

5. Conduct a survey to elicit views on proposed domains

We will undertake a survey of a broader group of interest holders to seek wider input on the domains to include in the tool. The interest holders will be selected to ensure representation from researchers with statistical simulation experience, methodologists and statisticians with expertise in statistical simulation design and users of the research. Potential participants will be identified by members of core and advisory groups. In addition, we will advertise the survey via relevant mailing lists (e.g., Statistical Society of Australia).

The core group will draft the survey and pilot test with members of the advisory committee. The survey will be created and distributed using Qualtrics online survey software (Qualtrics, Provo, Utah, USA. https://www.qualtrics.com). Participants will be presented with domains and their definitions, and asked to rate the importance of each domain. They will be given the opportunity to provide comments on the domains, definitions and whether there are any missing domains. We will also collect brief demographic information and ask whether participants would be willing to pilot test the tool.

We will calculate summary statistics to quantify participants’ views of the importance of the domains. Responses to the open-ended question will be summarised, keeping all unique ideas (regardless of the frequency with which they were made).

Ethics approval for the survey will be sought from the Monash University Human Research Ethics Committee.

6. Conduct consensus meetings with advisory group to agree domains

We will hold virtual consensus meeting(s), to discuss the survey results with the advisory group. Each domain will be introduced by a core group member, together with the summary statistics of the importance rating and any major comments. We will focus our discussion on domains that the survey participants have rated as not important, and on any additional domains suggested by the participants. The core group will revise the domains in response to these meetings.

7. Draft the tool and guidance

The core group will prepare a list of potential signalling questions, elaborations and response options for each domain. The wording for the signalling questions will be informed by the unique items identified within each practice from the systematic review (step 3). The core group will seek feedback from the advisory group, revise the domains, signalling questions, and other content, and continue this process until major concerns are addressed.

8. Pilot and refine the tool

The draft tool will be piloted on a random sample of statistical simulation studies identified in article type 1 of the systematic review by two reviewers. The reviewers will be identified from among the survey respondents who indicated a willingness to pilot the tool. We will ask the reviewers to record any issues with interpreting or applying each signalling question, or with the accompanying elaborations. Identified issues with domains and signalling questions will be discussed by the core group and refined. In addition, we will invite members of the advisory group to provide feedback. The core group will discuss and further refine any problematic signalling questions, before finalising the tool.

9. Disseminate the developed tool and guidance documentation

A paper describing the tool will be published in an open-access format. The tool and guidance document will be made available on a website housing risk of bias tools (https://www.riskofbias.info/). The tool will be disseminated via presentations and workshops at relevant conferences and workshops and via social media, and in a series of international webinars.

Study status

As of 3 June 2026, we have run the searches and screened, identified and extracted data from 44 statistical simulation studies; 47 protocols for statistical simulation studies; 56 articles that provide guidance, tutorial, commentary or evidence for unfair representation of the methods in statistical simulations; and 42 systematic reviews that include statistical simulation studies. We have commenced grouping the extracted items by practices that could lead to bias and synthesizing similar items under each practice.

Discussion

We plan to develop a tool to assess the potential for bias in the results of statistical simulation studies and unfair representation of the performance of statistical methods in statistical simulation studies. While there are reporting and conduct guidelines available for statistical simulation studies,²^,³^,¹³^,¹⁵^,²⁷^–³⁰ to our knowledge, this will be the first tool for systematically evaluating whether a statistical simulation study presents a fair representation of how the statistical methods under investigation are expected to perform. The developed tool is intended to assist researchers in determining which statistical methods to use in their statistical decision making, and inform evidence syntheses of statistical simulation studies.

To ensure the tool is relevant and useful to the end-users of statistical simulation studies, the tool will be developed through co-design with end-users, who are part of the core working group and the international advisory group, and whose views will be sought through the survey and piloting processes. To ensure a comprehensive list of possible domains are considered when developing the tool, we will first undertake a systematic review of the statistical simulation literature to identify domains. Recognising the challenge of identifying studies from this literature, we have involved an information specialist to develop the search strategies and utilise cited reference searches (forward and backward citations) of key methodological articles. In addition, the identified list of domains will be supplemented by input from the international advisory group and survey respondents.

Data Availability

Underlying data

No data are associated with this article.

Extended data

Open Science Framework (OSF): Development of a tool to assess the risk of bias in statistical simulation studies: study protocol. https://doi.org/10.17605/OSF.IO/FWVBC ⁴³

This project contains the following extended data:

• Appendix Table A1: Eligibility criteria for the different article types
• Appendix Table A2: Description and rationale for the search strategies
• Appendix Table A3: Data extraction items

Reporting guidelines

Open Science Framework (OSF): PRISMA-P Checklist for ‘Development of a tool to assess the risk of bias in statistical simulation studies: study protocol’ https://doi.org/10.17605/OSF.IO/FWVBC ⁴³

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

References

1. Arnup SJ, McKenzie JE, Korevaar E, et al.: Development of a Tool to Assess the Risk of Bias in Statistical Simulation Studies. OSF. 2026 [June 3]. Publisher Full Text
2. Morris TP, White IR, Crowther MJ: Using simulation studies to evaluate statistical methods. Stat Med. 2019 May 20; 38(11): 2074–102. Epub 20190116. PubMed Abstract | Publisher Full Text | Free Full Text
3. Burton A, Altman DG, Royston P, et al.: The design of simulation studies in medical statistics. Stat Med. 2006 Dec 30; 25(24): 4279–4292. PubMed Abstract | Publisher Full Text
4. Thompson JA, Leyrat C, Fielding KL, et al.: Cluster randomised trials with a binary outcome and a small number of clusters: comparison of individual and cluster level analysis method. BMC Med Res Methodol. 2022 Aug 22. Epub 20220812. PubMed Abstract | Publisher Full Text | Free Full Text
5. Hemming K, Thompson J, Kristunas C, et al.: The performance of small sample correction methods for controlling type I error when analyzing parallel cluster randomized trials: a systematic review of simulation studies. J Clin Epidemiol. 2025 May 30; 185: 111838. Epub 20250530. PubMed Abstract | Publisher Full Text
6. Boutron I, Page MJ, Higgins JPT, et al.: Chapter 7: Considering bias and conflicts of interest among the included studies. 2023. Cochrane Handbook for Systematic Reviews of Interventions version 64 (updated August 2023). Cochrane; 2023. Reference Source
7. Jelizarow M, Guillemot V, Tenenhaus A, et al.: Over-optimism in bioinformatics: an illustration. Bioinformatics. 2010 Aug 15; 26(16): 1990–8. Epub 20100626. PubMed Abstract | Publisher Full Text
8. Pawel S, Kook L, Reeve K: Pitfalls and potentials in simulation studies: Questionable research practices in comparative simulation studies allow for spurious claims of superiority of any method. Biom J. 2023 Mar 8; e2200091. Epub 20230308. PubMed Abstract
9. Ullmann T, Peschel S, Finger P, et al.: Over-optimism in unsupervised microbiome analysis: Insights from network learning and clustering. PLoS Comput Biol. 2023 Jan; 19: PubMed Abstract | Publisher Full Text | Free Full Text
10. Nießl C, Herrmann M, Wiedemann C, et al.: Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2022; 12(2): e1441. Publisher Full Text
11. Lohmann A, Astivia OLO, Morris TP, et al.: It’s time! Ten reasons to start replicating simulation studies. Front Epidemiol. 2022; 2(973470). Publisher Full Text
12. Boulesteix AL, Wilson R, Hapfelmeier A: Towards evidence-based computational statistics: lessons from clinical research on the role and design of real-data benchmark studies. BMC Med Res Methodol. 2017 Sep 17. Epub 20170909. PubMed Abstract | Publisher Full Text | Free Full Text
13. Weber LM, Saelens W, Cannoodt R, et al.: Essential guidelines for computational method benchmarking. Genome Biol. 2019 Jun 20; 20(1): 125. Epub 20190620. PubMed Abstract | Publisher Full Text | Free Full Text
14. Niessl C, Hoffmann S, Ullmann T, et al.: Explaining the optimistic performance evaluation of newly proposed methods: A cross-design validation experiment. Biom J. 2023 Mar 31; 66: e2200238. Epub 20230331. PubMed Abstract | Publisher Full Text
15. White IR, Pham TM, Quartagno M, et al.: How to check a simulation study. Int J Epidemiol. 2024 Feb 1; 53(1). PubMed Abstract | Publisher Full Text | Free Full Text
16. Boulesteix AL: Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol. 2015 Apr; 11(4): e1004191. Epub 20150423. PubMed Abstract | Free Full Text
17. Boulesteix AL: Over-optimism in bioinformatics research. Bioinformatics. 2010 Feb 1; 26(3): 437–439. Epub 20091126. Publisher Full Text PubMed Abstract |
18. Buchka S, Hapfelmeier A, Gardner PP, et al.: On the optimistic performance evaluation of newly introduced bioinformatic methods. Genome Biol. 2021 May 22. Epub 20210511. PubMed Abstract | Publisher Full Text | Free Full Text
19. Boulesteix AL, Lauer S, Eugster MJ: A plea for neutral comparison studies in computational sciences. PLoS One. 2013; 8. Epub 20130424. PubMed Abstract | Publisher Full Text | Free Full Text
20. Boulesteix AL, Stierle V, Hapfelmeier A: Publication Bias in Methodological Computational Research. Cancer Inform. 2015; 14s5(Suppl 5): CIN.S30747–9. Epub 20151015. PubMed Abstract | Publisher Full Text | Free Full Text
21. Boulesteix AL, Binder H, Abrahamowicz M, et al.: Simulation Panel of the Stratos Initiative. On the necessity and design of studies comparing statistical methods. Biom J. 2018 Jan; 60(1): 216–218. Epub 20171129. PubMed Abstract | Publisher Full Text
22. Chiocchia V, Nikolakopoulou A, Higgins JPT, et al.: ROB-MEN: a tool to assess risk of bias due to missing evidence in network meta-analysis. BMC Med. 2021 Nov 23; 19(1): 304. Epub 20211123. PubMed Abstract | Publisher Full Text | Free Full Text
23. Sterne JAC, Savovic J, Page MJ, et al.: RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019 Aug 28; 366: l4898. Epub 20190828. PubMed Abstract | Publisher Full Text
24. Sterne JA, Hernan MA, Reeves BC, et al.: ROBINS-I: a tool for assessing risk of bias in non-randomised studies of interventions. BMJ. 2016 Oct 12; 355: i4919. Epub 20161012. PubMed Abstract | Publisher Full Text | Free Full Text
25. Whiting P, Savovic J, Higgins JP, et al.: ROBIS: A new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016 Jan; 69: 225–234. Epub 20150616. PubMed Abstract | Publisher Full Text | Free Full Text
26. Wolff RF, Moons KGM, Riley RD, et al.: PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019 Jan 1; 170(1): 51–58. PubMed Abstract | Publisher Full Text
27. Kelter R: The Bayesian simulation study (BASIS) framework for simulation studies in statistical and methodological research. Biom J. 2024 Jan; 66: Epub 20230115. PubMed Abstract | Publisher Full Text
28. Siepe BS, Bartos F, Morris TP, et al.: Simulation studies for methodological research in psychology: A standardized template for planning, preregistration, and reporting. Psychol Methods. 2024 Nov 14. Epub 20241114. PubMed Abstract | Publisher Full Text | Free Full Text
29. Williams C, Yang Y, Lagisz M, et al.: Transparent reporting items for simulation studies evaluating statistical methods: Foundations for reproducibility and reliability. Methods in Ecology and Evolution. 2024; 15(11): 1926–1939. Publisher Full Text
30. Boulesteix AL, Groenwold RH, Abrahamowicz M, et al.: Introduction to statistical simulations in health research. BMJ Open. 2020 Dec 10; e039921. Epub 20201213. Publisher Full Text | Free Full Text
31. Whiting P, Wolff R, Mallett S, et al.: A proposed framework for developing quality assessment tools. Syst Rev. 2017 Oct 6. Epub 20171017. PubMed Abstract | Publisher Full Text | Free Full Text
32. Moher D, Schulz KF, Simera I, et al.: Guidance for developers of health research reporting guidelines. PLoS Med. 2010 Feb 16; 7(2): e1000217. Epub 20100216. PubMed Abstract | Free Full Text
33. Whiting PF, Rutjes AW, Westwood ME, et al.: QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011 Oct 18; 155(8): 529–536. PubMed Abstract | Publisher Full Text
34. Whiting P, Davies P, Savović J, et al.: Evidence to inform the development of ROBIS, a new tool to assess the risk of bias in systematic reviews.2013 September. Reference Source
35. Page MJ, McKenzie JE, Bossuyt PM, et al.: Updating guidance for reporting systematic reviews: development of the PRISMA 2020 statement. J Clin Epidemiol. 2021 Jun; 134: 103–112. Epub 20210209. PubMed Abstract | Publisher Full Text
36. Hemming K, Thompson JY, Hooper RL, et al.: Guidelines for the content of statistical analysis plans in clinical trials: protocol for an extension to cluster randomized trials. Trials. 2025 Feb 27; 26(1): 72. Epub 20250227. PubMed Abstract | Free Full Text
37. Page MJ, Sterne JAC, Boutron I, et al.: ROB-ME: a tool for assessing risk of bias due to missing evidence in systematic reviews with meta-analysis. BMJ. 2023 Nov 20; 383: e076754. Epub 20231120. PubMed Abstract | Publisher Full Text
38. Luijken K, Lohmann A, Alter U, et al.: Replicability of Simulation Studies for the Investigation of Statistical Methods: The RepliSims Project. R Soc Open Sci. 2024 Jan 17; 11(1): 231003. PubMed Abstract | Publisher Full Text | Free Full Text
39. Boulesteix AL, Baillie M, Edelmann D, et al.: Editorial for the special collection "Towards neutral comparison studies in methodological research". Biom J. 2024 Mar; 66(2): e2400031. PubMed Abstract | Publisher Full Text
40. Harris PA, Taylor R, Minor BL, et al.: The REDCap consortium: Building an international community of software platform partners. Journal of Biomedical Informatics. 2019; 95: 103208. Publisher Full Text
41. Harris PA, Taylor R, Thielke R, et al.: Research electronic data capture (REDCap)—A metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics. 2009; 42: 377–381. Publisher Full Text
42. StataCorp: Stata: Release 18. College Station, TX: StataCorp LLC; 2023.
43. Arnup SJ, Turner SL, Korevaar E, et al.: Development of a tool to assess the risk of bias in statistical simulation studies. OSF. 2026 [June 12]. Publisher Full Text
44. Li Y, Tu W: An additive-multiplicative model for longitudinal data with informative observation times. Stat Methods Med Res. 2024 May; 33(5): 807–824. Epub 20240408. PubMed Abstract | Publisher Full Text
45. Abbas-Aghababazadeh F, Xu W, Haibe-Kains B: The impact of violating the independence assumption in meta-analysis on biomarker discovery. Front Genet. 2022; 13: 1027345. Epub 20230104. PubMed Abstract | Publisher Full Text | Free Full Text
46. Jiang M, Lee S, O'Malley AJ, et al.: A novel causal mediation analysis approach for zero-inflated mediators. Stat Med. 2023 Jun 15; 42(13): 2061–2081. Epub 20230418. PubMed Abstract | Publisher Full Text | Free Full Text
47. Cho E: The accuracy of reliability coefficients: A reanalysis of existing simulations. Psychol Methods. 2024 Apr; 29(2): 331–349. Epub 20220127. PubMed Abstract
48. Turner SL, Forbes AB, Karahalios A, et al.: Evaluation of statistical methods used in the analysis of interrupted time series studies: a simulation study. BMC Med Res Methodol. 2021 Aug 21. Epub 20210828. PubMed Abstract | Publisher Full Text | Free Full Text
49. O'Kelly M, Anisimov V, Campbell C, et al.: Proposed best practice for projects that involve modelling and simulation. Pharm Stat. 2017 Mar; 16(2): 107–113. Epub 20161103. PubMed Abstract | Publisher Full Text
50. Heinze G, Boulesteix AL, Kammer M, et al.: Simulation Panel of the Stratos Initiative. Phases of methodological research in biostatistics-Building the evidence base for new methods. Biom J. 2023 Feb 3; 66: e2200222. Epub 20230203. PubMed Abstract | Publisher Full Text

Footnotes

[1] Bias in this context refers to the difference between the expected value of the estimator for the estimand and the true value of the estimand.

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 30 Jun 2026