Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review

Minyan Zeng; Shiwei Liu; David PQ Clark; Steve McDonald; Evan Mayo-Wilson; Xiangji Ying; Joe Menke; Mengfei Lan; Lan Jiang; Kiran Ninan; Jean-Pierre Oberste; Joanne E McKenzie; Halil Kilicoglu; Matthew J Page

doi:10.12688/f1000research.179775.1

Home Browse Artificial intelligence tools for automating assessments of reporting...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Study Protocol

Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review

[version 1; peer review: 2 approved with reservations]

Minyan Zeng ¹, Shiwei Liu², David PQ Clark¹, [...] Steve McDonald¹, Evan Mayo-Wilson³, Xiangji Ying³, Joe Menke², Mengfei Lan², Lan Jiang², Kiran Ninan³, Jean-Pierre Oberste³, Joanne E McKenzie¹, Halil Kilicoglu², Matthew J Page¹

Minyan Zeng ¹, Shiwei Liu², [...] David PQ Clark¹, Steve McDonald¹, Evan Mayo-Wilson³, Xiangji Ying³, Joe Menke², Mengfei Lan², Lan Jiang², Kiran Ninan³, Jean-Pierre Oberste³, Joanne E McKenzie¹, Halil Kilicoglu², Matthew J Page¹

PUBLISHED 28 Apr 2026

Author details Author details

¹ Methods in Evidence Synthesis Unit, Monash University School of Public Health and Preventive Medicine, Melbourne, Victoria, Australia
² School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, USA
³ Department of Epidemiology, University of North Carolina Gillings School of Global Public Health, Chapel Hill, USA

Minyan Zeng
Roles: Methodology, Writing – Original Draft Preparation

Shiwei Liu
Roles: Methodology, Writing – Review & Editing

David PQ Clark
Roles: Methodology, Writing – Review & Editing

Steve McDonald
Roles: Methodology, Writing – Review & Editing

Evan Mayo-Wilson
Roles: Methodology, Writing – Review & Editing

Xiangji Ying
Roles: Methodology, Writing – Review & Editing

Joe Menke
Roles: Methodology, Writing – Review & Editing

Mengfei Lan
Roles: Methodology, Writing – Review & Editing

Lan Jiang
Roles: Methodology, Writing – Review & Editing

Kiran Ninan
Roles: Methodology, Writing – Review & Editing

Jean-Pierre Oberste
Roles: Methodology, Writing – Review & Editing

Joanne E McKenzie
Roles: Methodology, Writing – Review & Editing

Halil Kilicoglu
Roles: Methodology, Writing – Review & Editing

Matthew J Page
Roles: Conceptualization, Methodology, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

Abstract

Background

Complete reporting of health-related research is necessary for users to understand, appraise, and apply research results appropriately. Reporting guidelines have been developed to support complete reporting. However, assessments of reporting guideline adherence remain inconsistent, time-consuming, and difficult to scale. Artificial intelligence (AI) tools, such as traditional natural language processing models and large language models, might provide a potential solution. While numerous AI tools have been developed, no comprehensive synthesis has been undertaken to investigate what they assess, how they are implemented and perform, and their potential utility.

Objective

This systematic review aims to synthesise the characteristics and findings of studies evaluating AI tools developed to assist or automate assessments of reporting guideline adherence.

Methods

We will search MEDLINE, Embase, Scopus, Europe PMC, ACM Digital Library, IEEE Xplore, arXiv and Cochrane Colloquium Abstracts, with no restrictions on date, language, or publication type. We will include studies that evaluate AI tools to assess adherence of health-related papers to any reporting guidelines. Two authors will independently screen records, extract data and assess risk of bias. We will extract study characteristics, AI tool details, how reporting guidelines are operationalised for AI assessment, AI implementation details, comparison details, and evaluation outcomes including agreement metrics, classification performance metrics, and utility indicators. We will present and summarise results through structured tables and plots, stratified by reporting guideline and AI tool type.

Discussion

This systematic review will provide a comprehensive synthesis of AI tools developed to automate assessments of reporting guideline adherence. It will provide interest holders with insights into what AI tools have been used, their implementation approaches, which AI tool types perform well, and any improvements that can be made to AI tools automating assessments of reporting guideline adherence in the future.

Keywords

Reporting guidelines, Artificial intelligence, Adherence

Corresponding author: Minyan Zeng

Competing interests: No competing interests were disclosed.

Grant information: This research was supported by a Monash University Early Career Research Excellence Program (ECREP) grant. MJP is supported by a National Health and Medical Research Council Investigator Grant (GNT2033917). JEM is supported by a National Health and Medical Research Council Investigator Grant (GNT2009612). The funders had no role in the study design, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2026 Zeng M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Zeng M, Liu S, Clark DP et al. Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review [version 1; peer review: 2 approved with reservations]. F1000Research 2026, 15:626 (https://doi.org/10.12688/f1000research.179775.1) First published: 28 Apr 2026, 15:626 (https://doi.org/10.12688/f1000research.179775.1) Latest published: 28 Apr 2026, 15:626 (https://doi.org/10.12688/f1000research.179775.1)

Introduction

Complete reporting of health-related research is necessary for users to understand, appraise, and apply research results appropriately. Reporting guidelines provide recommendations on what should be reported, why it should be reported, and include exemplars of complete reporting to guide authors and other interest holders (e.g. peer reviewers, editors).¹ Reporting guidelines have been developed for different types of research, such as PRISMA (preferred reporting items for systematic reviews and meta-analyses) for systematic reviews,² CONSORT (consolidated standards of reporting trials) for randomised trials,³ TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) for prediction models,⁴ STROBE (strengthening the reporting of observational studies in epidemiology) for observational studies⁵ and STARD (standard for reporting of diagnostic accuracy studies) for diagnostic studies.⁶ Many of these “core” reporting guidelines have multiple extensions that provide additional reporting recommendations for specific aspects not covered in the core statement (e.g., types of outcomes, specific designs, analytic methods).

Routine assessments of reporting guideline adherence have been performed manually by authors, editors, and reviewers to judge whether reporting recommendations have been met. Because reporting guidelines do not specify criteria for evaluating adherence, researchers have had to develop their own assessment criteria and methods.^7,8 Researchers must also decide whether to assess all checklist items/recommendations or only a subset, and meta-research studies suggest that most have chosen to focus on selected items.^7,9,10 These decisions have led to considerable variability in what is assessed and how it is assessed.^7,9,10 Also, manual evaluation is time-consuming and resource-intensive.¹¹ Additionally, research questions such as what characteristics (e.g., time, discipline, journal) predict better or worse reporting are difficult to address at scale with a large body of literature using a manual evaluation approach. Therefore, more efficient, consistent, and scalable methods are needed.

Artificial intelligence (AI), defined as computational systems capable of performing tasks that typically require human intelligence, such as learning, reasoning, and decision-making, might provide a potential solution. Early attempts to automate assessments of reporting guideline adherence relied on traditional natural language processing (NLP) models. Examples include CONSORT-NLP,¹² which combines rule-based and machine learning-based approaches to automatically complete the CONSORT checklist from randomized clinical trial reports, and the SPIRIT-CONSORT-TM,¹³ an annotated corpus designed to train NLP models to automatically assess adherence to reporting recommendations in clinical trial protocols and result publications. However, these traditional NLP systems generally require substantial guideline-specific annotated datasets for development, and are applicable only to the particular guideline for which they were designed. Moreover, most systems focus on detecting local text segments, which could limit their utility for end-to-end evaluation in long research publications with multimodal data components (e.g., text, tables, and figures).

The advent of large language models (LLMs) and vision language models (VLMs), such as GPT and Gemini, provides another opportunity to scale up assessments of reporting guideline adherence. Trained on extensive data from articles, books and other online sources,¹⁴ these models are capable of processing complex data components, extracting information, summarising evidence, and generating outputs that are relevant to reporting guideline items. Several studies have used these models to assess reporting guideline adherence.^15–17

However, the outputs of LLMs and VLMs are sensitive to how they are implemented. Data preprocessing, prompts, and model inference settings might all influence model performance on specific tasks. For example, empirical work has shown that different prompt templates and formatting can substantially influence LLM outputs, though advanced models (e.g., GPT-4 compared to GPT-3.5-turbo) may demonstrate more robustness to such variations.¹⁸ More importantly, because of the variability in assessment criteria and methods for evaluating adherence, researchers might use different prompts to ask subtly different questions for reporting guideline items (e.g., whether a guideline item is reported or whether it is reported adequately or fully). Additionally, even with identical prompts, fixed model parameters and fixed random seed, models may occasionally generate different outputs across runs due to hardware-level randomness. This leads to difficulties in achieving strict reproducibility. Their “black-box” nature also limits transparency in the process of decision-making, and model hallucinations, although an area of active improvement, may also challenge reliability in high-stakes fields such as health-related research.

While numerous AI systems and prototypes have been developed to automate assessment of reporting guideline adherence,^{11–13,15–17} no comprehensive synthesis has been undertaken to investigate what they assess, how they are implemented and perform, and their potential utility in research and publication workflows.

Objective

This systematic review aims to summarise and synthesize the characteristics and findings of studies evaluating AI tools developed to assist or automate assessments of reporting guideline adherence.

Methods

We have reported this protocol in accordance with the Preferred Reporting Items for Systematic reviews and Meta-analysis Protocols (PRISMA-P) statement¹⁹ and with consideration of the methods items in the more recent PRISMA 2020 statement.² We have not registered the review.

Eligibility criteria

• Study designs
We will include studies of any design that evaluate the performance of AI tools developed to assess adherence of health-related research papers to reporting guidelines. Eligible study designs include diagnostic accuracy studies, validation studies, and trials comparing AI tool and human performance, as well as methodological studies comparing different AI approaches. Studies will be included regardless of language, publication date, or publication type (e.g., journal article, conference proceeding).
• Reporting guidelines
We will include studies regardless of the reporting guideline evaluated, such as PRISMA, CONSORT, TRIPOD, STROBE, and STARD, and any of their extensions. By “reporting guideline”, we mean any document presenting reporting items that should appear in a research paper (regardless of whether presented as a checklist or structured text) and in which the authors explain how the items were developed.²⁰
• AI tools and comparator
We will include any AI application, tool, or algorithm that (i) makes judgements about reporting guideline adherence, or (ii) identifies relevant text about reporting guideline adherence in a paper without making a judgement about adherence. Eligible systems could include any models that learn patterns from text with/without imaging data in the research papers, such as traditional natural language processing models (e.g., rule-based and BERT-like models) as well as LLMs and VLMs (e.g., GPT-5.2 and Gemini 3). We will include studies that compare AI tools with human assessment and studies that compare multiple AI tools with each other. Studies without an explicit comparator will also be eligible.
• Outcomes
We will include studies regardless of the outcomes assessed or reported. Outcomes of interest to this review include: (i) agreement (overall and for each item/recommendation) between the AI tool and human assessors using raw and chance corrected agreement metrics (e.g., Cohen’s kappa); (ii) classification performance (overall and/or for each item/recommendation) as determined using metrics such as accuracy, F1 score, sensitivity, specificity, positive and negative predictive values, and c-statistic; and (iii) utility indicators (e.g., task completion time, computational/API cost, and token usage across papers).

Search methods

We will search bibliographic databases and supplementary sources for eligible studies. Databases include MEDLINE (via Ovid), Embase (via Ovid), Scopus, Europe PMC, ACM Digital Library, and IEEE Xplore. We will not limit searches by date, language, publication status or publication format (except for Europe PMC, which will be restricted to preprints). Europe PMC will be used to search across several preprint servers (e.g., medRxiv, bioRxiv, preprints.org, SSRN, etc.) and we will also search the arXiv preprint server, as it is not comprehensively covered by Europe PMC. Additional sources include the abstracts of the Cochrane Colloquium. The final part of the search will involve manually backward citation tracking and forward citation tracking using LENS.org for all studies included in the review.

An experienced information specialist (SM) designed the search strategies with input from the review team. The search includes terms related to the concepts of AI, adherence, and reporting. Several seed articles (based on articles known to the review team)^{11,13,15–17,21–24} were used to develop the MEDLINE search. The MEDLINE search was then translated and adapted for use in the other sources. The search strategy was iteratively tested to achieve an optimal balance between recall and precision. Full search strategies are available as Extended data (see Data availability section).²⁵

Study selection

All records will first be deduplicated using the built-in functions of the reference management tools we will use (i.e., EndNote and Covidence). Two reviewers (out of MZ, SL, DPQC, JM, ML, LJ, KN, JO) will then independently screen all titles and abstracts, and records that are considered eligible or uncertain by either reviewer will undergo full-text screening, where those reviewers will independently assess the full text of potentially eligible records. Any disagreements will be resolved by discussion or consulting with a third reviewer. Title and abstract screening of bibliographic databases records will be conducted using Covidence. For arXiv and Cochrane Colloquium Abstracts, a screening form will be created in Microsoft Excel with the link for each record and the search date.

Data extraction

Two reviewers (out of MZ, SL, DPQC) will independently conduct the data extraction using a data extraction form (available as Extended data; see Data availability section).²⁵ The data extraction form will be piloted by reviewers on a sample of included studies prior to the full data extraction process. Any discrepancies in the data collected between the two reviewers will be resolved via discussion or by consulting with a third reviewer (MJP or JEM). Data extraction will be conducted using a data extraction tool (REDCap version 15.5.30).²⁶ Where necessary and available, additional sources will be consulted to supplement information extracted from the included studies, such as published study protocols, registry entries, or primary dataset documentation. If information remains missing or unclear, we will contact the study authors for further information. The information that will be extracted from each included study is provided in Table 1 (available as Extended data; see Data availability section).²⁵

Quality assessment of included studies

To evaluate the quality of the included studies, two reviewers (out of MZ, SL, DPQC) will independently apply a defined set of quality indicators. These indicators are informed by established tools PROBAST+AI²⁷ and the tool used in a living systematic review of AI tools for risk of bias assessment,²⁸ which offer relevant concepts for assessing AI tools. The quality indicators will cover the following domains:

• AI tool development
Whether the AI tool was developed rigorously (e.g., adequate training model and prompt engineering).
• Reference standard
Whether the reference standard assessment was conducted rigorously (e.g., performed by trained assessors, assessed by at least two assessors independently with consensus procedures in place).
• Independence of assessments and risk of data leakage
Whether the AI tool was applied to the studies without knowledge of the reference standard assessment and vice versa; Whether the AI tool’s final performance was evaluated on an independent test set that was not used for model training or prompt development/refinement; Whether there was a low risk that the annotation of test corpus was part of the AI model’s training data.
• Study planning
Whether the study was based on a publicly available protocol or registration record.

Each indicator will be judged as low quality, high quality, or unclear quality. Quality assessment form is available as Extended data (see Data availability section).²⁵ A study will be deemed high quality overall if all quality indicators were deemed high quality, low quality overall if at least one indicator was deemed low quality, and unclear quality overall if at least one indicator was deemed unclear quality, but none were deemed low quality. Disagreements between reviewers will be resolved through discussion or adjudication by a third reviewer.

Data syntheses and analyses

Given the anticipated diversity in AI tools, reporting guidelines, study designs, and outcome measures, formal meta-analysis is unlikely to be feasible across all outcomes. We will therefore present and summarise results of each of the included studies through structured tables and plots.

We will use structured tables to present study characteristics, reporting guidelines assessed and scope, dataset characteristics, dataset sources and formats, reference annotation for datasets, AI tool details, application of the AI tool, AI implementation details, and comparison details. Tables will be organised by reporting guideline evaluated, and then by the type of AI tool (traditional NLP models versus LLM-based/VLM-based models).

We will then present AI tool performance and utility findings in tables organised by the reporting guideline evaluated, stratified by the type of AI tool and each outcome category (i.e., classification performance metrics, agreement metrics and utility indicators). Where multiple metrics are reported within the same outcome category, we will extract pre-specified metrics as detailed in the Data extraction form (available as Extended data; see Data availability section).²⁵ Where preferred metrics are unavailable, we will consider and note the alternative metrics reported by the study authors. We will summarise outcomes at overall level using descriptive statistics (e.g., mean, median, range across items) and also present the overall results in forest plots, stratified by reporting guideline and AI models. When item-level/recommendation-level outcomes are also available (e.g., classification performance metrics of adherence for each PRISMA item), we will summarise specific item-level results to facilitate performance interpretation using pre-specified rules, including the items with high and low performance (e.g., top and bottom five items for agreement metrics, accuracy and F1 score). When there are multiple results available for the same outcome across training, validation and test datasets, we will extract and summarise results identified by study authors as primary and/or the results from the most representative evaluation setting. In this circumstance, we will note that multiple results are available, and our reason for selecting the reported result.

We will finally present and summarise the overall quality of studies by the reporting guideline evaluated, stratified by the type of AI tool.

Dissemination plan

We plan to disseminate the findings of this systematic review through publication in a peer-reviewed scientific journal. The final manuscript will include all methods, results, and interpretations arising from the review to support transparency and reproducibility. In addition to journal publication, we will present the key findings at relevant academic conferences and seminars to reach researchers, and developers working in AI and reporting guidelines. We will also make our data extraction forms, summary tables, and analytical code publicly accessible to facilitate future research in this area.

Study status

This study is currently at study selection stage.

Discussion

Complete reporting of health-related research is important for the usability and trustworthiness of research evidence. Reporting guidelines have been widely used to support complete reporting. However, assessments of reporting guideline adherence remain inconsistent, time-consuming, and difficult to scale. AI tools have the potential to address these limitations. As the AI field continues to evolve rapidly, a rigorous evidence synthesis is timely. This systematic review will be the first to comprehensively summarise and synthesise what AI tools have been developed to automate assessments of reporting guideline adherence. It will provide interest holders with insights into what AI tools have been used, their implementation approaches, which AI tool types perform well, and any improvements that can be made to AI tools automating assessments of reporting guideline adherence in the future.

Data availability

Underlying data

No data are associated with this article.

Extended data

Open Science Framework: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. DOI: https://doi.org/10.17605/OSF.IO/AYSTK.²⁵

This project contains the following extended data:

• Table 1. docx
• APPENDIX Section 1 Search strategy.docx
• APPENDIX Section 2 Data extraction form.docx
• APPENDIX Section 3 Quality assessment form.docx

Reporting guidelines

Open Science Framework: PRISMA-P checklist for Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. DOI: https://doi.org/10.17605/OSF.IO/AYSTK.²⁵

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

References

1. EQUATOR Network - What is a reporting guideline: (access on 11 Feb 2026). Reference Source
2. Page MJ, McKenzie JE, Bossuyt PM, et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021; 372: n71.
3. Hopewell S, Chan A-W, Collins GS, et al.: CONSORT 2025 statement: updated guideline for reporting randomised trials. BMJ. 2025; 389: e081123. PubMed Abstract | Publisher Full Text | Free Full Text
4. Collins GS, Reitsma JB, Altman DG, et al.: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. J Br Surg. 2015; 102(3): 148–158.
5. Von Elm E, Altman DG, Egger M, et al.: The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. The Lancet. 2007; 370(9596): 1453–1457. Publisher Full Text
6. Bossuyt PM, Reitsma JB, Bruns DE, et al.: STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology. 2015; 277(3): 826–832. Publisher Full Text
7. Hamilton DG, McKenzie JE, Nejstgaard CH, et al.: Evaluation of tools used to assess adherence to PRISMA 2020 reveals inconsistent methods and poor tool implementability: part I of a systematic review. J Clin Epidemiol. 2026; 112133.
8. Dal Santo T, Rice DB, Amiri LSN, et al.: Methods and results of studies on reporting guideline adherence are poorly reported: a meta-research study. J Clin Epidemiol. 2023; 159: 225–234. PubMed Abstract | Publisher Full Text
9. Ivaldi D, Burgos M, Oltra G, et al.: Adherence to PRISMA 2020 statement assessed through the expanded checklist in systematic reviews of interventions: A meta-epidemiological study. Cochrane Evidence Synthesis and Methods. 2024; 2(5): e12074. PubMed Abstract | Publisher Full Text | Free Full Text
10. Turner L, Shamseer L, Altman DG, et al.: Consolidated standards of reporting trials (CONSORT) and the completeness of reporting of randomised controlled trials (RCTs) published in medical journals. Cochrane Database Syst Rev. 2012; 11(11): MR000030.
11. Woelfle T, Hirt J, Janiaud P, et al.: Benchmarking Human–AI collaboration for common evidence appraisal tools. J Clin Epidemiol. 2024; 175: 111533. PubMed Abstract | Publisher Full Text
12. Wang F, Schilsky RL, Page D, et al.: Development and Validation of a Natural Language Processing Tool to Generate the CONSORT Reporting Checklist for Randomized Clinical Trials. JAMA Netw Open. 2020; 3(10): e2014661. PubMed Abstract | Publisher Full Text | Free Full Text
13. Jiang L, Vorland CJ, Ying X, et al.: SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications. Scientific Data. 2025; 12(1): 355. PubMed Abstract | Publisher Full Text | Free Full Text
14. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al.: Large language models in medicine. Nature medicine. 2023; 29(8): 1930–1940. Publisher Full Text
15. Wrightson JG, Blazey P, Moher D, et al.: GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines. BMJ Open. 2025; 15(3): e088735. PubMed Abstract | Publisher Full Text | Free Full Text
16. Chen D, Li P, Khoshkish E, et al.: AutoReporter: Development of an artificial intelligence tool for automated assessment of research reporting guideline adherence. medRxiv. 2025; 2025.04. 18.25326076.
17. Forero DA, Abreu SE, Tovar BE, et al.: Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). Journal of Medical Systems. 2025; 49(1): 80. PubMed Abstract | Publisher Full Text | Free Full Text
18. He J, Rungta M, Koleczek D, et al.; Does prompt formatting have any impact on llm performance?. arXiv preprint arXiv:241110541. 2024.
19. Moher D, Shamseer L, Clarke M, et al.: Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev. 2015; 4(1): 1. Publisher Full Text
20. EQUATOR Network - How to develop a reporting guideline: (access on 11 Feb 2026). Reference Source
21. Srinivasan A, Berkowitz J, Friedrich NA, et al.: Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review. JAMA Network Open. 2025; 8(8): e2529418. PubMed Abstract | Publisher Full Text | Free Full Text
22. Alharbi F, Asiri S: Automated Assessment of Reporting Completeness in Orthodontic Research Using LLMs: An Observational Study. Applied Sciences. 2024; 14(22): 10323. Publisher Full Text
23. Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking. arXiv preprint arXiv:251116707. 2025.
24. He Z, Bian M, Zhu J, et al.: Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study. arXiv preprint arXiv:251113107. 2025.
25. Zeng M, Liu S, Clark DP, et al.: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. OSF. Publisher Full Text
26. Harris PA, Taylor R, Thielke R, et al.: Research electronic data capture (REDCap) - a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics. 2009; 42(2): 377–381. PubMed Abstract | Publisher Full Text | Free Full Text
27. Moons KGM, Damen JAA, Kaul T, et al.: PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025; 388: e082505. PubMed Abstract | Publisher Full Text | Free Full Text
28. Albarqouni L, Sondrup N, Ostengaard L, et al.: Artificial Intelligence tools for Risk of Bias assessment in systematic reviews (AI4RoB): a protocol for a living systematic review. OSF. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 28 Apr 2026

Author details Author details

¹ Methods in Evidence Synthesis Unit, Monash University School of Public Health and Preventive Medicine, Melbourne, Victoria, Australia
² School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, USA
³ Department of Epidemiology, University of North Carolina Gillings School of Global Public Health, Chapel Hill, USA

Minyan Zeng
Roles: Methodology, Writing – Original Draft Preparation

Shiwei Liu
Roles: Methodology, Writing – Review & Editing

David PQ Clark
Roles: Methodology, Writing – Review & Editing

Steve McDonald
Roles: Methodology, Writing – Review & Editing

Evan Mayo-Wilson
Roles: Methodology, Writing – Review & Editing

Xiangji Ying
Roles: Methodology, Writing – Review & Editing

Joe Menke
Roles: Methodology, Writing – Review & Editing

Mengfei Lan
Roles: Methodology, Writing – Review & Editing

Lan Jiang
Roles: Methodology, Writing – Review & Editing

Kiran Ninan
Roles: Methodology, Writing – Review & Editing

Jean-Pierre Oberste
Roles: Methodology, Writing – Review & Editing

Joanne E McKenzie
Roles: Methodology, Writing – Review & Editing

Halil Kilicoglu
Roles: Methodology, Writing – Review & Editing

Matthew J Page
Roles: Conceptualization, Methodology, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This research was supported by a Monash University Early Career Research Excellence Program (ECREP) grant. MJP is supported by a National Health and Medical Research Council Investigator Grant (GNT2033917). JEM is supported by a National Health and Medical Research Council Investigator Grant (GNT2009612). The funders had no role in the study design, decision to publish, or preparation of the manuscript.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 28 Apr 2026, 15:626

https://doi.org/10.12688/f1000research.179775.1

Copyright

© 2026 Zeng M et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Zeng M, Liu S, Clark DP et al. Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review [version 1; peer review: 2 approved with reservations]. F1000Research 2026, 15:626 (https://doi.org/10.12688/f1000research.179775.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 28 Apr 2026

Views

11

Reviewer Report 09 Jun 2026

Manuel Marques-Cruz, University of Porto, Porto, Porto District, Portugal

Approved with Reservations

https://doi.org/10.5256/f1000research.198323.r480816

Dear authors,
First of all, I would like to congratulate you on a (unsurprisingly) well-designed protocol for a systematic review.
I do not see any major flaws with the study design you presented. However, I do have some ... Continue reading

Dear authors,
First of all, I would like to congratulate you on a (unsurprisingly) well-designed protocol for a systematic review.
I do not see any major flaws with the study design you presented. However, I do have some thoughts I would like to share regarding some methodological choices, that I would like you to consider.

1. One difficulty in defining AI is where to draw the line between deterministic approaches and “real” “computational systems capable of performing tasks that typically require human intelligence”. Some rule-based models could therefore be classified as either AI or not-AI. I would suggest the authors constrain a little more the methods that they are considering as AI.

2. Building on the first point, the search strategy may need to be revised: (i) not all NLP methods will have been described as “NLP” or equivalent anywhere on the records to retrieve; (ii) if you imply that some authors may be defining the company/chatbot/commercial model used (gpt, claude, gemini) instead of the use of LLMs as methodology, then you must acknowledge the existence of other models that are equally LLMs (such as mistral, deepseek, llama, qwen,…); (iii) not all rule-based (not-NLP) methods will fit on the “machine learning or deep learning or supervised learning or unsupervised learning” either.

3. Regarding the concept of reporting guidelines there is an analogous situation. While defining reporting guideline as “any document presenting reporting items that should appear in a research paper”, you may have not exhausted all possible descriptors for this in the search strategy.

4. Concept 2 of the search strategy may also lead to loss of important records. There are other terms that may be used to define the use of automated methods to assess reporting guidelines, such as: “application”, “implementation” , … .

5. The main concern I tried to express is that you may end up not fulfilling your objective of doing a systematic review (which should be the synthesis of all available evidence) on AI (which was ill-defined) use on reporting guidelines (also ill-defined). I would be more keen on reporting these shortcomings in defining AI and reporting guidelines by assuming a focus on specific AI methods and specific reporting guidelines (which the search strategy already shows).

6. A second high concern is that, while aiming to synthetise all evidence, you may end up doing not a systematic review but doing a more superficial metadata analysis of these records (in line with a scoping, other than a systematic review). I am not convinced that it will not happen, based on the proposed data extraction form.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Yes
Are sufficient details of the methods provided to allow replication by others?

Yes
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Health Data Science, Machine Learning, Artificial Intelligence (Large Language Models), Evidence Synthesis.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

40

Reviewer Report 25 May 2026

Mbonigaba Celestin, Brainae University, Delaware, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.198323.r482643

1. Abstract
1.1: Absence of protocol registration creates transparency concern
The abstract omits the fact that the review was not prospectively registered. This omission weakens methodological transparency.
The manuscript later states:

“We have not ... Continue reading

1. Abstract
1.1: Absence of protocol registration creates transparency concern
The abstract omits the fact that the review was not prospectively registered. This omission weakens methodological transparency.
The manuscript later states:

“We have not registered the review.”

For a high impact systematic review protocol, failure to register in PROSPERO, OSF registration before commencement, or INPLASY reduces confidence in protocol immutability and selective reporting control.

Authors should explicitly justify non-registration in the abstract and methods.

2. Introduction
2.1: Conceptual distinction between “reporting adherence” and “reporting quality” is insufficiently clarified
The manuscript repeatedly treats adherence and quality as closely related constructs without explicitly distinguishing them.
This is problematic because:

reporting completeness ≠ methodological quality
AI tools may identify textual presence without assessing epistemic adequacy

This conceptual distinction is essential in meta-research methodology.
The introduction should explicitly define:

reporting adherence
reporting completeness
reporting quality
reporting transparency

and explain their boundaries.
2.3: Overstatement of scalability benefits without acknowledging computational limitations
The manuscript strongly promotes scalability benefits of AI but does not sufficiently discuss:

API cost barriers
GPU dependency
token limitations
multimodal processing failures
hallucination-induced false positives

A balanced discussion requires both opportunities and structural limitations.
3. Methods
3.1 PRISMA and Protocol Registration
3.1.1: Non-registration is a major methodological weakness
The statement:
“We have not registered the review.”
is a serious concern for a high-impact systematic review protocol.
This creates risk regarding:

protocol deviation
outcome switching
selective inclusion
post hoc methodological adaptation

At minimum, the authors should:

register retrospectively on OSF
provide timestamped protocol freeze
justify why registration was omitted

3.2 Eligibility Criteria
3.2.1: Inclusion criteria for AI systems are excessively broad
The manuscript includes:

rule-based systems
BERT-like models
LLMs
VLMs
tools without explicit comparator
tools without judgment generation

This breadth creates severe heterogeneity risk.
The authors must define:

minimum AI capability threshold
operational definition of “AI tool”
distinction between extraction systems and evaluative systems

Otherwise, synthesis validity may become compromised.
3.2.2: No restriction on publication type introduces high risk of low-quality evidence inclusion
Including:

preprints
conference abstracts
non-peer-reviewed studies
arXiv manuscripts

without a weighting strategy threatens evidentiary consistency.
The protocol lacks:

publication quality stratification
sensitivity analysis excluding preprints
risk weighting by peer-review status

This should be added.
3.2.3: Comparator definition is methodologically vague
The manuscript states:
“Studies without an explicit comparator will also be eligible.”
This creates a major evaluation problem because:

tool performance becomes uninterpretable
no benchmark validity exists
internal claims cannot be verified

The authors should justify how performance validity will be interpreted in non-comparator studies.
3.3 Search Strategy
3.3.1: AI terminology search coverage may be insufficient
The protocol does not clearly indicate whether search terms include:

generative AI
foundation models
transformer models
GPT
Gemini
Claude
retrieval augmented generation
prompt engineering

Given the rapid evolution of terminology, missing these terms risks retrieval bias.
3.4 Study Selection
3.4.1: No mention of calibration exercises for reviewers
The protocol does not specify:

pilot screening agreement
calibration threshold
kappa agreement target
reviewer training procedures

This weakens reproducibility and consistency assurance.
3.5 Data Extraction
3.5.1: Extraction framework is underdeveloped for AI reproducibility assessment
The planned extraction omits critical AI reproducibility variables such as:

model version
inference temperature
random seed
API version
hardware dependency
context window size
prompt chaining
retrieval augmentation use

These variables are essential for AI methodological interpretation.
3.5.2: No extraction of dataset governance characteristics
The protocol ignores:

dataset licensing
annotation provenance
copyright status
benchmark contamination risks

These are highly important in AI evaluation research.
3.7 Data Synthesis
3.7.1: Statistical synthesis plan is underdeveloped
The protocol states that meta-analysis is unlikely feasible but does not specify:

criteria for determining feasibility
heterogeneity thresholds
subgroup analysis plan
meta-regression possibilities
publication bias assessment

This creates analytical incompleteness.
3.7.2: Forest plots are proposed without clear effect size harmonization strategy
The manuscript proposes forest plots despite highly heterogeneous metrics:

accuracy
F1
sensitivity
agreement
kappa

Without standardization, pooled visual interpretation may become misleading.
The authors should define:

standardized performance metrics
transformation methods
normalization approach

4. Discussion
4.1: Ethical and governance implications are insufficiently explored
Major missing themes:

AI replacing peer reviewers
editorial accountability
bias amplification
automated gatekeeping risks
transparency obligations in AI-assisted review

These issues are central to responsible AI deployment.
5. References and Citation Audit
5.1: Several references appear problematic or potentially unverifiable
The manuscript contains references that require verification because DOI, indexing, or stable retrieval evidence is unclear.
Potentially problematic references include:
Reference 16
“Chen D, Li P, Khoshkish E, et al.: AutoReporter...”
medRxiv. 2025; 2025.04.18.25326076.
Needs verification:

DOI not shown
unstable preprint identification format
unclear peer-review status

Reference 18
“He J, Rungta M, Koleczek D, et al.; Does prompt formatting have any impact on llm performance?. arXiv preprint arXiv:241110541.”
Problems:

arXiv identifier formatting appears incorrect
should likely contain decimal structure
title formatting inconsistent
capitalization inconsistent

This reference requires correction and verification.
Reference 23
“Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking. arXiv preprint arXiv:251116707. 2025.”
Potential issue:

arXiv identifier format likely invalid

Reference 24
“He Z, Bian M, Zhu J, et al.: Evaluating the Ability of Large Language Models...”
arXiv preprint arXiv:251113107. 2025.
Potential issue:

malformed arXiv identifier

5.2: Inconsistent citation formatting
Several references show inconsistent formatting regarding:

title capitalization
journal style
DOI presentation
punctuation
URL presentation

The reference list requires full standardization.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: AI

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 28 Apr 2026

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 28 Apr 26	read	read

Mbonigaba Celestin, Brainae University, Delaware, USA
Manuel Marques-Cruz, University of Porto, Porto, Portugal

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

11 Views

09 Jun 2026 | for Version 1

Manuel Marques-Cruz, University of Porto, Porto, Porto District, Portugal

11 Views Cite this report Responses(0)

Approved With Reservations

Dear authors,
First of all, I would like to congratulate you on a (unsurprisingly) well-designed protocol for a systematic review.
I do not see any major flaws with the study design you presented. However, I do have some thoughts I would like to share regarding some methodological choices, that I would like you to consider.

1. One difficulty in defining AI is where to draw the line between deterministic approaches and “real” “computational systems capable of performing tasks that typically require human intelligence”. Some rule-based models could therefore be classified as either AI or not-AI. I would suggest the authors constrain a little more the methods that they are considering as AI.

2. Building on the first point, the search strategy may need to be revised: (i) not all NLP methods will have been described as “NLP” or equivalent anywhere on the records to retrieve; (ii) if you imply that some authors may be defining the company/chatbot/commercial model used (gpt, claude, gemini) instead of the use of LLMs as methodology, then you must acknowledge the existence of other models that are equally LLMs (such as mistral, deepseek, llama, qwen,…); (iii) not all rule-based (not-NLP) methods will fit on the “machine learning or deep learning or supervised learning or unsupervised learning” either.

3. Regarding the concept of reporting guidelines there is an analogous situation. While defining reporting guideline as “any document presenting reporting items that should appear in a research paper”, you may have not exhausted all possible descriptors for this in the search strategy.

4. Concept 2 of the search strategy may also lead to loss of important records. There are other terms that may be used to define the use of automated methods to assess reporting guidelines, such as: “application”, “implementation” , … .

5. The main concern I tried to express is that you may end up not fulfilling your objective of doing a systematic review (which should be the synthesis of all available evidence) on AI (which was ill-defined) use on reporting guidelines (also ill-defined). I would be more keen on reporting these shortcomings in defining AI and reporting guidelines by assuming a focus on specific AI methods and specific reporting guidelines (which the search strategy already shows).

6. A second high concern is that, while aiming to synthetise all evidence, you may end up doing not a systematic review but doing a more superficial metadata analysis of these records (in line with a scoping, other than a systematic review). I am not convinced that it will not happen, based on the proposed data extraction form.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Yes
Are sufficient details of the methods provided to allow replication by others?

Yes
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Health Data Science, Machine Learning, Artificial Intelligence (Large Language Models), Evidence Synthesis.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

40 Views

25 May 2026 | for Version 1

Mbonigaba Celestin, Brainae University, Delaware, USA

40 Views Cite this report Responses(0)

Approved With Reservations

1. Abstract
1.1: Absence of protocol registration creates transparency concern
The abstract omits the fact that the review was not prospectively registered. This omission weakens methodological transparency.
The manuscript later states:

“We have not registered the review.”

For a high impact systematic review protocol, failure to register in PROSPERO, OSF registration before commencement, or INPLASY reduces confidence in protocol immutability and selective reporting control.

Authors should explicitly justify non-registration in the abstract and methods.

2. Introduction
2.1: Conceptual distinction between “reporting adherence” and “reporting quality” is insufficiently clarified
The manuscript repeatedly treats adherence and quality as closely related constructs without explicitly distinguishing them.
This is problematic because:

reporting completeness ≠ methodological quality
AI tools may identify textual presence without assessing epistemic adequacy

This conceptual distinction is essential in meta-research methodology.
The introduction should explicitly define:

reporting adherence
reporting completeness
reporting quality
reporting transparency

and explain their boundaries.
2.3: Overstatement of scalability benefits without acknowledging computational limitations
The manuscript strongly promotes scalability benefits of AI but does not sufficiently discuss:

API cost barriers
GPU dependency
token limitations
multimodal processing failures
hallucination-induced false positives

A balanced discussion requires both opportunities and structural limitations.
3. Methods
3.1 PRISMA and Protocol Registration
3.1.1: Non-registration is a major methodological weakness
The statement:
“We have not registered the review.”
is a serious concern for a high-impact systematic review protocol.
This creates risk regarding:

protocol deviation
outcome switching
selective inclusion
post hoc methodological adaptation

At minimum, the authors should:

register retrospectively on OSF
provide timestamped protocol freeze
justify why registration was omitted

3.2 Eligibility Criteria
3.2.1: Inclusion criteria for AI systems are excessively broad
The manuscript includes:

rule-based systems
BERT-like models
LLMs
VLMs
tools without explicit comparator
tools without judgment generation

This breadth creates severe heterogeneity risk.
The authors must define:

minimum AI capability threshold
operational definition of “AI tool”
distinction between extraction systems and evaluative systems

Otherwise, synthesis validity may become compromised.
3.2.2: No restriction on publication type introduces high risk of low-quality evidence inclusion
Including:

preprints
conference abstracts
non-peer-reviewed studies
arXiv manuscripts

without a weighting strategy threatens evidentiary consistency.
The protocol lacks:

publication quality stratification
sensitivity analysis excluding preprints
risk weighting by peer-review status

This should be added.
3.2.3: Comparator definition is methodologically vague
The manuscript states:
“Studies without an explicit comparator will also be eligible.”
This creates a major evaluation problem because:

tool performance becomes uninterpretable
no benchmark validity exists
internal claims cannot be verified

The authors should justify how performance validity will be interpreted in non-comparator studies.
3.3 Search Strategy
3.3.1: AI terminology search coverage may be insufficient
The protocol does not clearly indicate whether search terms include:

generative AI
foundation models
transformer models
GPT
Gemini
Claude
retrieval augmented generation
prompt engineering

Given the rapid evolution of terminology, missing these terms risks retrieval bias.
3.4 Study Selection
3.4.1: No mention of calibration exercises for reviewers
The protocol does not specify:

pilot screening agreement
calibration threshold
kappa agreement target
reviewer training procedures

This weakens reproducibility and consistency assurance.
3.5 Data Extraction
3.5.1: Extraction framework is underdeveloped for AI reproducibility assessment
The planned extraction omits critical AI reproducibility variables such as:

model version
inference temperature
random seed
API version
hardware dependency
context window size
prompt chaining
retrieval augmentation use

These variables are essential for AI methodological interpretation.
3.5.2: No extraction of dataset governance characteristics
The protocol ignores:

dataset licensing
annotation provenance
copyright status
benchmark contamination risks

These are highly important in AI evaluation research.
3.7 Data Synthesis
3.7.1: Statistical synthesis plan is underdeveloped
The protocol states that meta-analysis is unlikely feasible but does not specify:

criteria for determining feasibility
heterogeneity thresholds
subgroup analysis plan
meta-regression possibilities
publication bias assessment

This creates analytical incompleteness.
3.7.2: Forest plots are proposed without clear effect size harmonization strategy
The manuscript proposes forest plots despite highly heterogeneous metrics:

accuracy
F1
sensitivity
agreement
kappa

Without standardization, pooled visual interpretation may become misleading.
The authors should define:

standardized performance metrics
transformation methods
normalization approach

4. Discussion
4.1: Ethical and governance implications are insufficiently explored
Major missing themes:

AI replacing peer reviewers
editorial accountability
bias amplification
automated gatekeeping risks
transparency obligations in AI-assisted review

These issues are central to responsible AI deployment.
5. References and Citation Audit
5.1: Several references appear problematic or potentially unverifiable
The manuscript contains references that require verification because DOI, indexing, or stable retrieval evidence is unclear.
Potentially problematic references include:
Reference 16
“Chen D, Li P, Khoshkish E, et al.: AutoReporter...”
medRxiv. 2025; 2025.04.18.25326076.
Needs verification:

DOI not shown
unstable preprint identification format
unclear peer-review status

Reference 18
“He J, Rungta M, Koleczek D, et al.; Does prompt formatting have any impact on llm performance?. arXiv preprint arXiv:241110541.”
Problems:

arXiv identifier formatting appears incorrect
should likely contain decimal structure
title formatting inconsistent
capitalization inconsistent

This reference requires correction and verification.
Reference 23
“Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking. arXiv preprint arXiv:251116707. 2025.”
Potential issue:

arXiv identifier format likely invalid

Reference 24
“He Z, Bian M, Zhu J, et al.: Evaluating the Ability of Large Language Models...”
arXiv preprint arXiv:251113107. 2025.
Potential issue:

malformed arXiv identifier

5.2: Inconsistent citation formatting
Several references show inconsistent formatting regarding:

title capitalization
journal style
DOI presentation
punctuation
URL presentation

The reference list requires full standardization.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

AI

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. EQUATOR Network - What is a reporting guideline: (access on 11 Feb 2026). Reference Source

[2] 2. Page MJ, McKenzie JE, Bossuyt PM, et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021; 372: n71.

[3] 3. Hopewell S, Chan A-W, Collins GS, et al.: CONSORT 2025 statement: updated guideline for reporting randomised trials. BMJ. 2025; 389: e081123. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Collins GS, Reitsma JB, Altman DG, et al.: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. J Br Surg. 2015; 102(3): 148–158.

[5] 5. Von Elm E, Altman DG, Egger M, et al.: The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. The Lancet. 2007; 370(9596): 1453–1457. Publisher Full Text

[6] 6. Bossuyt PM, Reitsma JB, Bruns DE, et al.: STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology. 2015; 277(3): 826–832. Publisher Full Text

[7] 7. Hamilton DG, McKenzie JE, Nejstgaard CH, et al.: Evaluation of tools used to assess adherence to PRISMA 2020 reveals inconsistent methods and poor tool implementability: part I of a systematic review. J Clin Epidemiol. 2026; 112133.

[8] 8. Dal Santo T, Rice DB, Amiri LSN, et al.: Methods and results of studies on reporting guideline adherence are poorly reported: a meta-research study. J Clin Epidemiol. 2023; 159: 225–234. PubMed Abstract | Publisher Full Text

[9] 9. Ivaldi D, Burgos M, Oltra G, et al.: Adherence to PRISMA 2020 statement assessed through the expanded checklist in systematic reviews of interventions: A meta-epidemiological study. Cochrane Evidence Synthesis and Methods. 2024; 2(5): e12074. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Turner L, Shamseer L, Altman DG, et al.: Consolidated standards of reporting trials (CONSORT) and the completeness of reporting of randomised controlled trials (RCTs) published in medical journals. Cochrane Database Syst Rev. 2012; 11(11): MR000030.

[11] 11. Woelfle T, Hirt J, Janiaud P, et al.: Benchmarking Human–AI collaboration for common evidence appraisal tools. J Clin Epidemiol. 2024; 175: 111533. PubMed Abstract | Publisher Full Text

[12] 12. Wang F, Schilsky RL, Page D, et al.: Development and Validation of a Natural Language Processing Tool to Generate the CONSORT Reporting Checklist for Randomized Clinical Trials. JAMA Netw Open. 2020; 3(10): e2014661. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Jiang L, Vorland CJ, Ying X, et al.: SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications. Scientific Data. 2025; 12(1): 355. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al.: Large language models in medicine. Nature medicine. 2023; 29(8): 1930–1940. Publisher Full Text

[15] 15. Wrightson JG, Blazey P, Moher D, et al.: GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines. BMJ Open. 2025; 15(3): e088735. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Chen D, Li P, Khoshkish E, et al.: AutoReporter: Development of an artificial intelligence tool for automated assessment of research reporting guideline adherence. medRxiv. 2025; 2025.04. 18.25326076.

[17] 17. Forero DA, Abreu SE, Tovar BE, et al.: Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). Journal of Medical Systems. 2025; 49(1): 80. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. He J, Rungta M, Koleczek D, et al.; Does prompt formatting have any impact on llm performance?. arXiv preprint arXiv:241110541. 2024.

[19] 19. Moher D, Shamseer L, Clarke M, et al.: Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev. 2015; 4(1): 1. Publisher Full Text

[20] 20. EQUATOR Network - How to develop a reporting guideline: (access on 11 Feb 2026). Reference Source

[21] 21. Srinivasan A, Berkowitz J, Friedrich NA, et al.: Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review. JAMA Network Open. 2025; 8(8): e2529418. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. Alharbi F, Asiri S: Automated Assessment of Reporting Completeness in Orthodontic Research Using LLMs: An Observational Study. Applied Sciences. 2024; 14(22): 10323. Publisher Full Text

[23] 23. Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking. arXiv preprint arXiv:251116707. 2025.

[24] 24. He Z, Bian M, Zhu J, et al.: Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study. arXiv preprint arXiv:251113107. 2025.

[25] 25. Zeng M, Liu S, Clark DP, et al.: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. OSF. Publisher Full Text

[26] 26. Harris PA, Taylor R, Thielke R, et al.: Research electronic data capture (REDCap) - a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics. 2009; 42(2): 377–381. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Moons KGM, Damen JAA, Kaul T, et al.: PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025; 388: e082505. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Albarqouni L, Sondrup N, Ostengaard L, et al.: Artificial Intelligence tools for Risk of Bias assessment in systematic reviews (AI4RoB): a protocol for a living systematic review. OSF. 2025. Publisher Full Text

Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review

Abstract

Background

Objective

Methods

Discussion

Keywords

Introduction

Objective

Methods

Eligibility criteria

Search methods

Study selection

Data extraction

Quality assessment of included studies

Data syntheses and analyses

Dissemination plan

Study status

Discussion

Data availability

Underlying data

Extended data

Reporting guidelines

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated