Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.179775.1

Study Protocol

Articles

Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review

[version 1; peer review: 2 approved with reservations]

Zeng

Minyan

Methodology Writing – Original Draft Preparation https://orcid.org/0000-0001-7294-2599 a 1 Liu

Shiwei

Methodology Writing – Review & Editing https://orcid.org/0009-0006-9382-1538 2 Clark

David PQ

Methodology Writing – Review & Editing 1 McDonald

Steve

Methodology Writing – Review & Editing https://orcid.org/0000-0003-2832-5205 1 Mayo-Wilson

Evan

Methodology Writing – Review & Editing 3 Ying

Xiangji

Methodology Writing – Review & Editing 3 Menke

Joe

Methodology Writing – Review & Editing 2 Lan

Mengfei

Methodology Writing – Review & Editing 2 Jiang

Lan

Methodology Writing – Review & Editing 2 Ninan

Kiran

Methodology Writing – Review & Editing 3 Oberste

Jean-Pierre

Methodology Writing – Review & Editing https://orcid.org/0009-0003-2075-5267 3 McKenzie

Joanne E

Methodology Writing – Review & Editing https://orcid.org/0000-0003-3534-1641 1 Kilicoglu

Halil

Methodology Writing – Review & Editing 2 Page

Matthew J

Conceptualization Methodology Writing – Review & Editing https://orcid.org/0000-0002-4242-7526 1 1Methods in Evidence Synthesis Unit, Monash University School of Public Health and Preventive Medicine, Melbourne, Victoria, Australia 2School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, USA 3Department of Epidemiology, University of North Carolina Gillings School of Global Public Health, Chapel Hill, USA

a minyan.zeng@monash.edu

No competing interests were disclosed.

28 4 2026

2026

626

10 4 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Complete reporting of health-related research is necessary for users to understand, appraise, and apply research results appropriately. Reporting guidelines have been developed to support complete reporting. However, assessments of reporting guideline adherence remain inconsistent, time-consuming, and difficult to scale. Artificial intelligence (AI) tools, such as traditional natural language processing models and large language models, might provide a potential solution. While numerous AI tools have been developed, no comprehensive synthesis has been undertaken to investigate what they assess, how they are implemented and perform, and their potential utility.

Objective

This systematic review aims to synthesise the characteristics and findings of studies evaluating AI tools developed to assist or automate assessments of reporting guideline adherence.

Methods

We will search MEDLINE, Embase, Scopus, Europe PMC, ACM Digital Library, IEEE Xplore, arXiv and Cochrane Colloquium Abstracts, with no restrictions on date, language, or publication type. We will include studies that evaluate AI tools to assess adherence of health-related papers to any reporting guidelines. Two authors will independently screen records, extract data and assess risk of bias. We will extract study characteristics, AI tool details, how reporting guidelines are operationalised for AI assessment, AI implementation details, comparison details, and evaluation outcomes including agreement metrics, classification performance metrics, and utility indicators. We will present and summarise results through structured tables and plots, stratified by reporting guideline and AI tool type.

Discussion

This systematic review will provide a comprehensive synthesis of AI tools developed to automate assessments of reporting guideline adherence. It will provide interest holders with insights into what AI tools have been used, their implementation approaches, which AI tool types perform well, and any improvements that can be made to AI tools automating assessments of reporting guideline adherence in the future.

Reporting guidelines Artificial intelligence Adherence

Monash University Early Career Research Excellence Program (ECREP) grant

National Health and Medical Research Council Investigator Grant

GNT2009612

National Health and Medical Research Council Investigator Grant

GNT2033917

This research was supported by a Monash University Early Career Research Excellence Program (ECREP) grant. MJP is supported by a National Health and Medical Research Council Investigator Grant (GNT2033917). JEM is supported by a National Health and Medical Research Council Investigator Grant (GNT2009612). The funders had no role in the study design, decision to publish, or preparation of the manuscript.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction

Complete reporting of health-related research is necessary for users to understand, appraise, and apply research results appropriately. Reporting guidelines provide recommendations on what should be reported, why it should be reported, and include exemplars of complete reporting to guide authors and other interest holders (e.g. peer reviewers, editors). ¹ Reporting guidelines have been developed for different types of research, such as PRISMA (preferred reporting items for systematic reviews and meta-analyses) for systematic reviews, ² CONSORT (consolidated standards of reporting trials) for randomised trials, ³ TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) for prediction models, ⁴ STROBE (strengthening the reporting of observational studies in epidemiology) for observational studies ⁵ and STARD (standard for reporting of diagnostic accuracy studies) for diagnostic studies. ⁶ Many of these “core” reporting guidelines have multiple extensions that provide additional reporting recommendations for specific aspects not covered in the core statement (e.g., types of outcomes, specific designs, analytic methods).

Routine assessments of reporting guideline adherence have been performed manually by authors, editors, and reviewers to judge whether reporting recommendations have been met. Because reporting guidelines do not specify criteria for evaluating adherence, researchers have had to develop their own assessment criteria and methods. ^{7,
8} Researchers must also decide whether to assess all checklist items/recommendations or only a subset, and meta-research studies suggest that most have chosen to focus on selected items. ^{7,
9,
10} These decisions have led to considerable variability in what is assessed and how it is assessed. ^{7,
9,
10} Also, manual evaluation is time-consuming and resource-intensive. ¹¹ Additionally, research questions such as what characteristics (e.g., time, discipline, journal) predict better or worse reporting are difficult to address at scale with a large body of literature using a manual evaluation approach. Therefore, more efficient, consistent, and scalable methods are needed.

Artificial intelligence (AI), defined as computational systems capable of performing tasks that typically require human intelligence, such as learning, reasoning, and decision-making, might provide a potential solution. Early attempts to automate assessments of reporting guideline adherence relied on traditional natural language processing (NLP) models. Examples include CONSORT-NLP, ¹² which combines rule-based and machine learning-based approaches to automatically complete the CONSORT checklist from randomized clinical trial reports, and the SPIRIT-CONSORT-TM, ¹³ an annotated corpus designed to train NLP models to automatically assess adherence to reporting recommendations in clinical trial protocols and result publications. However, these traditional NLP systems generally require substantial guideline-specific annotated datasets for development, and are applicable only to the particular guideline for which they were designed. Moreover, most systems focus on detecting local text segments, which could limit their utility for end-to-end evaluation in long research publications with multimodal data components (e.g., text, tables, and figures).

The advent of large language models (LLMs) and vision language models (VLMs), such as GPT and Gemini, provides another opportunity to scale up assessments of reporting guideline adherence. Trained on extensive data from articles, books and other online sources, ¹⁴ these models are capable of processing complex data components, extracting information, summarising evidence, and generating outputs that are relevant to reporting guideline items. Several studies have used these models to assess reporting guideline adherence. ^{15–
17}

However, the outputs of LLMs and VLMs are sensitive to how they are implemented. Data preprocessing, prompts, and model inference settings might all influence model performance on specific tasks. For example, empirical work has shown that different prompt templates and formatting can substantially influence LLM outputs, though advanced models (e.g., GPT-4 compared to GPT-3.5-turbo) may demonstrate more robustness to such variations. ¹⁸ More importantly, because of the variability in assessment criteria and methods for evaluating adherence, researchers might use different prompts to ask subtly different questions for reporting guideline items (e.g., whether a guideline item is reported or whether it is reported adequately or fully). Additionally, even with identical prompts, fixed model parameters and fixed random seed, models may occasionally generate different outputs across runs due to hardware-level randomness. This leads to difficulties in achieving strict reproducibility. Their “black-box” nature also limits transparency in the process of decision-making, and model hallucinations, although an area of active improvement, may also challenge reliability in high-stakes fields such as health-related research.

While numerous AI systems and prototypes have been developed to automate assessment of reporting guideline adherence, ^{11–
13,
15–
17} no comprehensive synthesis has been undertaken to investigate what they assess, how they are implemented and perform, and their potential utility in research and publication workflows.

Objective

This systematic review aims to summarise and synthesize the characteristics and findings of studies evaluating AI tools developed to assist or automate assessments of reporting guideline adherence.

Methods

We have reported this protocol in accordance with the Preferred Reporting Items for Systematic reviews and Meta-analysis Protocols (PRISMA-P) statement ¹⁹ and with consideration of the methods items in the more recent PRISMA 2020 statement. ² We have not registered the review.

Eligibility criteria

•

Study designs

We will include studies of any design that evaluate the performance of AI tools developed to assess adherence of health-related research papers to reporting guidelines. Eligible study designs include diagnostic accuracy studies, validation studies, and trials comparing AI tool and human performance, as well as methodological studies comparing different AI approaches. Studies will be included regardless of language, publication date, or publication type (e.g., journal article, conference proceeding).

•

Reporting guidelines

We will include studies regardless of the reporting guideline evaluated, such as PRISMA, CONSORT, TRIPOD, STROBE, and STARD, and any of their extensions. By “reporting guideline”, we mean any document presenting reporting items that should appear in a research paper (regardless of whether presented as a checklist or structured text) and in which the authors explain how the items were developed. ²⁰

•

AI tools and comparator

We will include any AI application, tool, or algorithm that (i) makes judgements about reporting guideline adherence, or (ii) identifies relevant text about reporting guideline adherence in a paper without making a judgement about adherence. Eligible systems could include any models that learn patterns from text with/without imaging data in the research papers, such as traditional natural language processing models (e.g., rule-based and BERT-like models) as well as LLMs and VLMs (e.g., GPT-5.2 and Gemini 3). We will include studies that compare AI tools with human assessment and studies that compare multiple AI tools with each other. Studies without an explicit comparator will also be eligible.

•

Outcomes

We will include studies regardless of the outcomes assessed or reported. Outcomes of interest to this review include: (i) agreement (overall and for each item/recommendation) between the AI tool and human assessors using raw and chance corrected agreement metrics (e.g., Cohen’s kappa); (ii) classification performance (overall and/or for each item/recommendation) as determined using metrics such as accuracy, F1 score, sensitivity, specificity, positive and negative predictive values, and c-statistic; and (iii) utility indicators (e.g., task completion time, computational/API cost, and token usage across papers).

Search methods

We will search bibliographic databases and supplementary sources for eligible studies. Databases include MEDLINE (via Ovid), Embase (via Ovid), Scopus, Europe PMC, ACM Digital Library, and IEEE Xplore. We will not limit searches by date, language, publication status or publication format (except for Europe PMC, which will be restricted to preprints). Europe PMC will be used to search across several preprint servers (e.g., medRxiv, bioRxiv, preprints.org, SSRN, etc.) and we will also search the arXiv preprint server, as it is not comprehensively covered by Europe PMC. Additional sources include the abstracts of the Cochrane Colloquium. The final part of the search will involve manually backward citation tracking and forward citation tracking using LENS.org for all studies included in the review.

An experienced information specialist (SM) designed the search strategies with input from the review team. The search includes terms related to the concepts of AI, adherence, and reporting. Several seed articles (based on articles known to the review team) ^{11,
13,
15–
17,
21–
24} were used to develop the MEDLINE search. The MEDLINE search was then translated and adapted for use in the other sources. The search strategy was iteratively tested to achieve an optimal balance between recall and precision. Full search strategies are available as Extended data (see Data availability section). ²⁵

Study selection

All records will first be deduplicated using the built-in functions of the reference management tools we will use (i.e., EndNote and Covidence). Two reviewers (out of MZ, SL, DPQC, JM, ML, LJ, KN, JO) will then independently screen all titles and abstracts, and records that are considered eligible or uncertain by either reviewer will undergo full-text screening, where those reviewers will independently assess the full text of potentially eligible records. Any disagreements will be resolved by discussion or consulting with a third reviewer. Title and abstract screening of bibliographic databases records will be conducted using Covidence. For arXiv and Cochrane Colloquium Abstracts, a screening form will be created in Microsoft Excel with the link for each record and the search date.

Data extraction

Two reviewers (out of MZ, SL, DPQC) will independently conduct the data extraction using a data extraction form (available as Extended data; see Data availability section). ²⁵ The data extraction form will be piloted by reviewers on a sample of included studies prior to the full data extraction process. Any discrepancies in the data collected between the two reviewers will be resolved via discussion or by consulting with a third reviewer (MJP or JEM). Data extraction will be conducted using a data extraction tool (REDCap version 15.5.30). ²⁶ Where necessary and available, additional sources will be consulted to supplement information extracted from the included studies, such as published study protocols, registry entries, or primary dataset documentation. If information remains missing or unclear, we will contact the study authors for further information. The information that will be extracted from each included study is provided in Table 1 (available as Extended data; see Data availability section). ²⁵

Quality assessment of included studies

To evaluate the quality of the included studies, two reviewers (out of MZ, SL, DPQC) will independently apply a defined set of quality indicators. These indicators are informed by established tools PROBAST+AI ²⁷ and the tool used in a living systematic review of AI tools for risk of bias assessment, ²⁸ which offer relevant concepts for assessing AI tools. The quality indicators will cover the following domains: •

AI tool development

Whether the AI tool was developed rigorously (e.g., adequate training model and prompt engineering).

•

Reference standard

Whether the reference standard assessment was conducted rigorously (e.g., performed by trained assessors, assessed by at least two assessors independently with consensus procedures in place).

•

Independence of assessments and risk of data leakage

Whether the AI tool was applied to the studies without knowledge of the reference standard assessment and vice versa; Whether the AI tool’s final performance was evaluated on an independent test set that was not used for model training or prompt development/refinement; Whether there was a low risk that the annotation of test corpus was part of the AI model’s training data.

•

Study planning

Whether the study was based on a publicly available protocol or registration record.

Each indicator will be judged as low quality, high quality, or unclear quality. Quality assessment form is available as Extended data (see Data availability section). ²⁵ A study will be deemed high quality overall if all quality indicators were deemed high quality, low quality overall if at least one indicator was deemed low quality, and unclear quality overall if at least one indicator was deemed unclear quality, but none were deemed low quality. Disagreements between reviewers will be resolved through discussion or adjudication by a third reviewer.

Data syntheses and analyses

Given the anticipated diversity in AI tools, reporting guidelines, study designs, and outcome measures, formal meta-analysis is unlikely to be feasible across all outcomes. We will therefore present and summarise results of each of the included studies through structured tables and plots.

We will use structured tables to present study characteristics, reporting guidelines assessed and scope, dataset characteristics, dataset sources and formats, reference annotation for datasets, AI tool details, application of the AI tool, AI implementation details, and comparison details. Tables will be organised by reporting guideline evaluated, and then by the type of AI tool (traditional NLP models versus LLM-based/VLM-based models).

We will then present AI tool performance and utility findings in tables organised by the reporting guideline evaluated, stratified by the type of AI tool and each outcome category (i.e., classification performance metrics, agreement metrics and utility indicators). Where multiple metrics are reported within the same outcome category, we will extract pre-specified metrics as detailed in the Data extraction form (available as Extended data; see Data availability section). ²⁵ Where preferred metrics are unavailable, we will consider and note the alternative metrics reported by the study authors. We will summarise outcomes at overall level using descriptive statistics (e.g., mean, median, range across items) and also present the overall results in forest plots, stratified by reporting guideline and AI models. When item-level/recommendation-level outcomes are also available (e.g., classification performance metrics of adherence for each PRISMA item), we will summarise specific item-level results to facilitate performance interpretation using pre-specified rules, including the items with high and low performance (e.g., top and bottom five items for agreement metrics, accuracy and F1 score). When there are multiple results available for the same outcome across training, validation and test datasets, we will extract and summarise results identified by study authors as primary and/or the results from the most representative evaluation setting. In this circumstance, we will note that multiple results are available, and our reason for selecting the reported result.

We will finally present and summarise the overall quality of studies by the reporting guideline evaluated, stratified by the type of AI tool.

Dissemination plan

We plan to disseminate the findings of this systematic review through publication in a peer-reviewed scientific journal. The final manuscript will include all methods, results, and interpretations arising from the review to support transparency and reproducibility. In addition to journal publication, we will present the key findings at relevant academic conferences and seminars to reach researchers, and developers working in AI and reporting guidelines. We will also make our data extraction forms, summary tables, and analytical code publicly accessible to facilitate future research in this area.

Study status

This study is currently at study selection stage.

Discussion

Complete reporting of health-related research is important for the usability and trustworthiness of research evidence. Reporting guidelines have been widely used to support complete reporting. However, assessments of reporting guideline adherence remain inconsistent, time-consuming, and difficult to scale. AI tools have the potential to address these limitations. As the AI field continues to evolve rapidly, a rigorous evidence synthesis is timely. This systematic review will be the first to comprehensively summarise and synthesise what AI tools have been developed to automate assessments of reporting guideline adherence. It will provide interest holders with insights into what AI tools have been used, their implementation approaches, which AI tool types perform well, and any improvements that can be made to AI tools automating assessments of reporting guideline adherence in the future.

Data availability Underlying data

No data are associated with this article.

Extended data

Open Science Framework: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. DOI: https://doi.org/10.17605/OSF.IO/AYSTK. ²⁵

This project contains the following extended data: •

Table 1. docx

•

APPENDIX Section 1 Search strategy.docx

•

APPENDIX Section 2 Data extraction form.docx

•

APPENDIX Section 3 Quality assessment form.docx

Reporting guidelines

Open Science Framework: PRISMA-P checklist for Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. DOI: https://doi.org/10.17605/OSF.IO/AYSTK. ²⁵

Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

References 1

EQUATOR Network - What is a reporting guideline: (access on 11 Feb 2026). Reference Source

Page

McKenzie

Bossuyt

: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71.

Hopewell

Chan

A-W

Collins

: CONSORT 2025 statement: updated guideline for reporting randomised trials. BMJ. 2025;389:e081123. 40228833

10.1136/bmj-2024-081123

PMC11995449

Collins

Reitsma

Altman

: Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. J Br Surg. 2015;102(3):148–158.

Von Elm

Altman

Egger

: The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. The Lancet. 2007;370(9596):1453–1457. 10.1016/S0140-6736(07)61602-X

Bossuyt

Reitsma

Bruns

: STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology. 2015;277(3):826–832. 10.1148/radiol.2015151516

Hamilton

McKenzie

Nejstgaard

: Evaluation of tools used to assess adherence to PRISMA 2020 reveals inconsistent methods and poor tool implementability: part I of a systematic review. J Clin Epidemiol. 2026;112133.

Dal Santo

Rice

Amiri

LSN

: Methods and results of studies on reporting guideline adherence are poorly reported: a meta-research study. J Clin Epidemiol. 2023;159:225–234. 37271424

10.1016/j.jclinepi.2023.05.017

Ivaldi

Burgos

Oltra

: Adherence to PRISMA 2020 statement assessed through the expanded checklist in systematic reviews of interventions: A meta-epidemiological study. Cochrane Evidence Synthesis and Methods. 2024;2(5):e12074. 40476264

10.1002/cesm.12074

PMC11795886

Turner

Shamseer

Altman

: Consolidated standards of reporting trials (CONSORT) and the completeness of reporting of randomised controlled trials (RCTs) published in medical journals. Cochrane Database Syst Rev. 2012;11(11):MR000030.

Woelfle

Hirt

Janiaud

: Benchmarking Human–AI collaboration for common evidence appraisal tools. J Clin Epidemiol. 2024;175:111533. 39277058

10.1016/j.jclinepi.2024.111533

Wang

Schilsky

Page

: Development and Validation of a Natural Language Processing Tool to Generate the CONSORT Reporting Checklist for Randomized Clinical Trials. JAMA Netw Open. 2020;3(10):e2014661. 33030549

10.1001/jamanetworkopen.2020.14661

PMC7545295

Jiang

Vorland

Ying

: SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications. Scientific Data. 2025;12(1):355. 40021657

10.1038/s41597-025-04629-1

PMC11871027

Thirunavukarasu

Ting

DSJ

Elangovan

: Large language models in medicine. Nature medicine. 2023;29(8):1930–1940. 10.1038/s41591-023-02448-8

Wrightson

Blazey

Moher

: GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines. BMJ Open. 2025;15(3):e088735. 40107689

10.1136/bmjopen-2024-088735

PMC11927406

Chen

Khoshkish

: AutoReporter: Development of an artificial intelligence tool for automated assessment of research reporting guideline adherence. medRxiv. 2025;2025.04. 18.25326076.

Forero

Abreu

Tovar

: Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). Journal of Medical Systems. 2025;49(1):80. 40504403

10.1007/s10916-025-02212-0

PMC12162794

Rungta

Koleczek

; Does prompt formatting have any impact on llm performance?. arXiv preprint arXiv:241110541. 2024.

Moher

Shamseer

Clarke

: Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015 statement. Syst Rev. 2015;4(1):1. 10.1186/2046-4053-4-1

EQUATOR Network - How to develop a reporting guideline: (access on 11 Feb 2026). Reference Source

Srinivasan

Berkowitz

Friedrich

: Large Language Model Analysis of Reporting Quality of Randomized Clinical Trial Articles: A Systematic Review. JAMA Network Open. 2025;8(8):e2529418. 40875232

10.1001/jamanetworkopen.2025.29418

PMC12395317

Alharbi

Asiri

: Automated Assessment of Reporting Completeness in Orthodontic Research Using LLMs: An Observational Study. Applied Sciences. 2024;14(22):10323. 10.3390/app142210323

Kataoka

Banno

: Large language models for automated PRISMA 2020 adherence checking. arXiv preprint arXiv:251116707. 2025.

Bian

Zhu

: Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study. arXiv preprint arXiv:251113107. 2025.

Zeng

Liu

Clark

: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review. OSF.

10.17605/OSFIO/AYSTK.2026

Harris

Taylor

Thielke

: Research electronic data capture (REDCap) - a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics. 2009;42(2):377–381. 18929686

10.1016/j.jbi.2008.08.010

PMC2700030

Moons

KGM

Damen

JAA

Kaul

: PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ. 2025;388:e082505. 40127903

10.1136/bmj-2024-082505

PMC11931409

Albarqouni

Sondrup

Ostengaard

: Artificial Intelligence tools for Risk of Bias assessment in systematic reviews (AI4RoB): a protocol for a living systematic review. OSF. 2025. 1017605/OSFIO/RDEZ3

10.5256/f1000research.198323.r480816

Reviewer response for version 1

Marques-Cruz

Manuel

1 Referee https://orcid.org/0000-0002-9827-2551 1University of Porto, Porto, Porto District, Portugal

Competing interests: No competing interests were disclosed.

9 6 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

Dear authors,

First of all, I would like to congratulate you on a (unsurprisingly) well-designed protocol for a systematic review.

I do not see any major flaws with the study design you presented. However, I do have some thoughts I would like to share regarding some methodological choices, that I would like you to consider.

1. One difficulty in defining AI is where to draw the line between deterministic approaches and “real” “computational systems capable of performing tasks that typically require human intelligence”. Some rule-based models could therefore be classified as either AI or not-AI. I would suggest the authors constrain a little more the methods that they are considering as AI.

2. Building on the first point, the search strategy may need to be revised: (i) not all NLP methods will have been described as “NLP” or equivalent anywhere on the records to retrieve; (ii) if you imply that some authors may be defining the company/chatbot/commercial model used (gpt, claude, gemini) instead of the use of LLMs as methodology, then you must acknowledge the existence of other models that are equally LLMs (such as mistral, deepseek, llama, qwen,…); (iii) not all rule-based (not-NLP) methods will fit on the “machine learning or deep learning or supervised learning or unsupervised learning” either.

3. Regarding the concept of reporting guidelines there is an analogous situation. While defining reporting guideline as “any document presenting reporting items that should appear in a research paper”, you may have not exhausted all possible descriptors for this in the search strategy.

4. Concept 2 of the search strategy may also lead to loss of important records. There are other terms that may be used to define the use of automated methods to assess reporting guidelines, such as: “application”, “implementation” , … .

5. The main concern I tried to express is that you may end up not fulfilling your objective of doing a systematic review (which should be the synthesis of all available evidence) on AI (which was ill-defined) use on reporting guidelines (also ill-defined). I would be more keen on reporting these shortcomings in defining AI and reporting guidelines by assuming a focus on specific AI methods and specific reporting guidelines (which the search strategy already shows).

6. A second high concern is that, while aiming to synthetise all evidence, you may end up doing not a systematic review but doing a more superficial metadata analysis of these records (in line with a scoping, other than a systematic review). I am not convinced that it will not happen, based on the proposed data extraction form.

Is the study design appropriate for the research question?

Yes

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Yes

Are the datasets clearly presented in a useable and accessible format?

Not applicable

Reviewer Expertise:

Health Data Science, Machine Learning, Artificial Intelligence (Large Language Models), Evidence Synthesis.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.5256/f1000research.198323.r482643

Reviewer response for version 1

Celestin

Mbonigaba

1 Referee https://orcid.org/0000-0002-7381-8888 1Brainae University, Delaware, USA

Competing interests: No competing interests were disclosed.

25 5 2026

2026

recommendation

approve-with-reservations

1. Abstract

1.1: Absence of protocol registration creates transparency concern

The abstract omits the fact that the review was not prospectively registered. This omission weakens methodological transparency.

The manuscript later states:

“We have not registered the review.”

For a high impact systematic review protocol, failure to register in PROSPERO, OSF registration before commencement, or INPLASY reduces confidence in protocol immutability and selective reporting control.

Authors should explicitly justify non-registration in the abstract and methods.

2. Introduction

2.1: Conceptual distinction between “reporting adherence” and “reporting quality” is insufficiently clarified

The manuscript repeatedly treats adherence and quality as closely related constructs without explicitly distinguishing them.

This is problematic because:

reporting completeness ≠ methodological quality

AI tools may identify textual presence without assessing epistemic adequacy

This conceptual distinction is essential in meta-research methodology.

The introduction should explicitly define:

reporting adherence

reporting completeness

reporting quality

reporting transparency

and explain their boundaries.

2.3: Overstatement of scalability benefits without acknowledging computational limitations

The manuscript strongly promotes scalability benefits of AI but does not sufficiently discuss:

API cost barriers

GPU dependency

token limitations

multimodal processing failures

hallucination-induced false positives

A balanced discussion requires both opportunities and structural limitations.

3. Methods

3.1 PRISMA and Protocol Registration

3.1.1: Non-registration is a major methodological weakness

The statement:

“We have not registered the review.”

is a serious concern for a high-impact systematic review protocol.

This creates risk regarding:

protocol deviation

outcome switching

selective inclusion

post hoc methodological adaptation

At minimum, the authors should:

provide timestamped protocol freeze

justify why registration was omitted

3.2 Eligibility Criteria

3.2.1: Inclusion criteria for AI systems are excessively broad

The manuscript includes:

rule-based systems

BERT-like models

LLMs

VLMs

tools without explicit comparator

tools without judgment generation

This breadth creates severe heterogeneity risk.

The authors must define:

minimum AI capability threshold

operational definition of “AI tool”

distinction between extraction systems and evaluative systems

Otherwise, synthesis validity may become compromised.

3.2.2: No restriction on publication type introduces high risk of low-quality evidence inclusion

Including:

preprints

conference abstracts

non-peer-reviewed studies

arXiv manuscripts

without a weighting strategy threatens evidentiary consistency.

The protocol lacks:

publication quality stratification

sensitivity analysis excluding preprints

risk weighting by peer-review status

This should be added.

3.2.3: Comparator definition is methodologically vague

The manuscript states:

“Studies without an explicit comparator will also be eligible.”

This creates a major evaluation problem because:

tool performance becomes uninterpretable

no benchmark validity exists

internal claims cannot be verified

The authors should justify how performance validity will be interpreted in non-comparator studies.

3.3 Search Strategy

3.3.1: AI terminology search coverage may be insufficient

The protocol does not clearly indicate whether search terms include:

generative AI

foundation models

transformer models

GPT

Gemini

Claude

retrieval augmented generation

prompt engineering

Given the rapid evolution of terminology, missing these terms risks retrieval bias.

3.4 Study Selection

3.4.1: No mention of calibration exercises for reviewers

The protocol does not specify:

pilot screening agreement

calibration threshold

kappa agreement target

reviewer training procedures

This weakens reproducibility and consistency assurance.

3.5 Data Extraction

3.5.1: Extraction framework is underdeveloped for AI reproducibility assessment

The planned extraction omits critical AI reproducibility variables such as:

model version

inference temperature

random seed

API version

hardware dependency

context window size

prompt chaining

retrieval augmentation use

These variables are essential for AI methodological interpretation.

3.5.2: No extraction of dataset governance characteristics

The protocol ignores:

dataset licensing

annotation provenance

benchmark contamination risks

These are highly important in AI evaluation research.

3.7 Data Synthesis

3.7.1: Statistical synthesis plan is underdeveloped

The protocol states that meta-analysis is unlikely feasible but does not specify:

criteria for determining feasibility

heterogeneity thresholds

subgroup analysis plan

meta-regression possibilities

publication bias assessment

This creates analytical incompleteness.

3.7.2: Forest plots are proposed without clear effect size harmonization strategy

The manuscript proposes forest plots despite highly heterogeneous metrics:

accuracy

sensitivity

agreement

kappa

Without standardization, pooled visual interpretation may become misleading.

The authors should define:

standardized performance metrics

transformation methods

normalization approach

4. Discussion

4.1: Ethical and governance implications are insufficiently explored

Major missing themes:

AI replacing peer reviewers

editorial accountability

bias amplification

automated gatekeeping risks

transparency obligations in AI-assisted review

These issues are central to responsible AI deployment.

5. References and Citation Audit

5.1: Several references appear problematic or potentially unverifiable

The manuscript contains references that require verification because DOI, indexing, or stable retrieval evidence is unclear.

Potentially problematic references include:

Reference 16

“Chen D, Li P, Khoshkish E, et al.: AutoReporter...”

medRxiv. 2025; 2025.04.18.25326076.

Needs verification:

DOI not shown

unstable preprint identification format

unclear peer-review status

Reference 18

“He J, Rungta M, Koleczek D, et al.; Does prompt formatting have any impact on llm performance?. arXiv preprint arXiv:241110541.”

Problems:

arXiv identifier formatting appears incorrect

should likely contain decimal structure

title formatting inconsistent

capitalization inconsistent

This reference requires correction and verification.

Reference 23

“Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking. arXiv preprint arXiv:251116707. 2025.”

Potential issue:

arXiv identifier format likely invalid

Reference 24

“He Z, Bian M, Zhu J, et al.: Evaluating the Ability of Large Language Models...”

arXiv preprint arXiv:251113107. 2025.

Potential issue:

malformed arXiv identifier

5.2: Inconsistent citation formatting

Several references show inconsistent formatting regarding:

title capitalization

journal style

DOI presentation

punctuation

URL presentation

The reference list requires full standardization.

Is the study design appropriate for the research question?

Partly

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Partly

Are the datasets clearly presented in a useable and accessible format?

Not applicable

Reviewer Expertise: