ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Study Protocol

Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review

[version 1; peer review: 2 approved with reservations]
PUBLISHED 28 Apr 2026
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

Abstract

Background

Complete reporting of health-related research is necessary for users to understand, appraise, and apply research results appropriately. Reporting guidelines have been developed to support complete reporting. However, assessments of reporting guideline adherence remain inconsistent, time-consuming, and difficult to scale. Artificial intelligence (AI) tools, such as traditional natural language processing models and large language models, might provide a potential solution. While numerous AI tools have been developed, no comprehensive synthesis has been undertaken to investigate what they assess, how they are implemented and perform, and their potential utility.

Objective

This systematic review aims to synthesise the characteristics and findings of studies evaluating AI tools developed to assist or automate assessments of reporting guideline adherence.

Methods

We will search MEDLINE, Embase, Scopus, Europe PMC, ACM Digital Library, IEEE Xplore, arXiv and Cochrane Colloquium Abstracts, with no restrictions on date, language, or publication type. We will include studies that evaluate AI tools to assess adherence of health-related papers to any reporting guidelines. Two authors will independently screen records, extract data and assess risk of bias. We will extract study characteristics, AI tool details, how reporting guidelines are operationalised for AI assessment, AI implementation details, comparison details, and evaluation outcomes including agreement metrics, classification performance metrics, and utility indicators. We will present and summarise results through structured tables and plots, stratified by reporting guideline and AI tool type.

Discussion

This systematic review will provide a comprehensive synthesis of AI tools developed to automate assessments of reporting guideline adherence. It will provide interest holders with insights into what AI tools have been used, their implementation approaches, which AI tool types perform well, and any improvements that can be made to AI tools automating assessments of reporting guideline adherence in the future.

Keywords

Reporting guidelines, Artificial intelligence, Adherence

Introduction

Complete reporting of health-related research is necessary for users to understand, appraise, and apply research results appropriately. Reporting guidelines provide recommendations on what should be reported, why it should be reported, and include exemplars of complete reporting to guide authors and other interest holders (e.g. peer reviewers, editors).1 Reporting guidelines have been developed for different types of research, such as PRISMA (preferred reporting items for systematic reviews and meta-analyses) for systematic reviews,2 CONSORT (consolidated standards of reporting trials) for randomised trials,3 TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) for prediction models,4 STROBE (strengthening the reporting of observational studies in epidemiology) for observational studies5 and STARD (standard for reporting of diagnostic accuracy studies) for diagnostic studies.6 Many of these “core” reporting guidelines have multiple extensions that provide additional reporting recommendations for specific aspects not covered in the core statement (e.g., types of outcomes, specific designs, analytic methods).

Routine assessments of reporting guideline adherence have been performed manually by authors, editors, and reviewers to judge whether reporting recommendations have been met. Because reporting guidelines do not specify criteria for evaluating adherence, researchers have had to develop their own assessment criteria and methods.7,8 Researchers must also decide whether to assess all checklist items/recommendations or only a subset, and meta-research studies suggest that most have chosen to focus on selected items.7,9,10 These decisions have led to considerable variability in what is assessed and how it is assessed.7,9,10 Also, manual evaluation is time-consuming and resource-intensive.11 Additionally, research questions such as what characteristics (e.g., time, discipline, journal) predict better or worse reporting are difficult to address at scale with a large body of literature using a manual evaluation approach. Therefore, more efficient, consistent, and scalable methods are needed.

Artificial intelligence (AI), defined as computational systems capable of performing tasks that typically require human intelligence, such as learning, reasoning, and decision-making, might provide a potential solution. Early attempts to automate assessments of reporting guideline adherence relied on traditional natural language processing (NLP) models. Examples include CONSORT-NLP,12 which combines rule-based and machine learning-based approaches to automatically complete the CONSORT checklist from randomized clinical trial reports, and the SPIRIT-CONSORT-TM,13 an annotated corpus designed to train NLP models to automatically assess adherence to reporting recommendations in clinical trial protocols and result publications. However, these traditional NLP systems generally require substantial guideline-specific annotated datasets for development, and are applicable only to the particular guideline for which they were designed. Moreover, most systems focus on detecting local text segments, which could limit their utility for end-to-end evaluation in long research publications with multimodal data components (e.g., text, tables, and figures).

The advent of large language models (LLMs) and vision language models (VLMs), such as GPT and Gemini, provides another opportunity to scale up assessments of reporting guideline adherence. Trained on extensive data from articles, books and other online sources,14 these models are capable of processing complex data components, extracting information, summarising evidence, and generating outputs that are relevant to reporting guideline items. Several studies have used these models to assess reporting guideline adherence.1517

However, the outputs of LLMs and VLMs are sensitive to how they are implemented. Data preprocessing, prompts, and model inference settings might all influence model performance on specific tasks. For example, empirical work has shown that different prompt templates and formatting can substantially influence LLM outputs, though advanced models (e.g., GPT-4 compared to GPT-3.5-turbo) may demonstrate more robustness to such variations.18 More importantly, because of the variability in assessment criteria and methods for evaluating adherence, researchers might use different prompts to ask subtly different questions for reporting guideline items (e.g., whether a guideline item is reported or whether it is reported adequately or fully). Additionally, even with identical prompts, fixed model parameters and fixed random seed, models may occasionally generate different outputs across runs due to hardware-level randomness. This leads to difficulties in achieving strict reproducibility. Their “black-box” nature also limits transparency in the process of decision-making, and model hallucinations, although an area of active improvement, may also challenge reliability in high-stakes fields such as health-related research.

While numerous AI systems and prototypes have been developed to automate assessment of reporting guideline adherence,1113,1517 no comprehensive synthesis has been undertaken to investigate what they assess, how they are implemented and perform, and their potential utility in research and publication workflows.

Objective

This systematic review aims to summarise and synthesize the characteristics and findings of studies evaluating AI tools developed to assist or automate assessments of reporting guideline adherence.

Methods

We have reported this protocol in accordance with the Preferred Reporting Items for Systematic reviews and Meta-analysis Protocols (PRISMA-P) statement19 and with consideration of the methods items in the more recent PRISMA 2020 statement.2 We have not registered the review.

Eligibility criteria

  • Study designs

    We will include studies of any design that evaluate the performance of AI tools developed to assess adherence of health-related research papers to reporting guidelines. Eligible study designs include diagnostic accuracy studies, validation studies, and trials comparing AI tool and human performance, as well as methodological studies comparing different AI approaches. Studies will be included regardless of language, publication date, or publication type (e.g., journal article, conference proceeding).

  • Reporting guidelines

    We will include studies regardless of the reporting guideline evaluated, such as PRISMA, CONSORT, TRIPOD, STROBE, and STARD, and any of their extensions. By “reporting guideline”, we mean any document presenting reporting items that should appear in a research paper (regardless of whether presented as a checklist or structured text) and in which the authors explain how the items were developed.20

  • AI tools and comparator

    We will include any AI application, tool, or algorithm that (i) makes judgements about reporting guideline adherence, or (ii) identifies relevant text about reporting guideline adherence in a paper without making a judgement about adherence. Eligible systems could include any models that learn patterns from text with/without imaging data in the research papers, such as traditional natural language processing models (e.g., rule-based and BERT-like models) as well as LLMs and VLMs (e.g., GPT-5.2 and Gemini 3). We will include studies that compare AI tools with human assessment and studies that compare multiple AI tools with each other. Studies without an explicit comparator will also be eligible.

  • Outcomes

    We will include studies regardless of the outcomes assessed or reported. Outcomes of interest to this review include: (i) agreement (overall and for each item/recommendation) between the AI tool and human assessors using raw and chance corrected agreement metrics (e.g., Cohen’s kappa); (ii) classification performance (overall and/or for each item/recommendation) as determined using metrics such as accuracy, F1 score, sensitivity, specificity, positive and negative predictive values, and c-statistic; and (iii) utility indicators (e.g., task completion time, computational/API cost, and token usage across papers).

Search methods

We will search bibliographic databases and supplementary sources for eligible studies. Databases include MEDLINE (via Ovid), Embase (via Ovid), Scopus, Europe PMC, ACM Digital Library, and IEEE Xplore. We will not limit searches by date, language, publication status or publication format (except for Europe PMC, which will be restricted to preprints). Europe PMC will be used to search across several preprint servers (e.g., medRxiv, bioRxiv, preprints.org, SSRN, etc.) and we will also search the arXiv preprint server, as it is not comprehensively covered by Europe PMC. Additional sources include the abstracts of the Cochrane Colloquium. The final part of the search will involve manually backward citation tracking and forward citation tracking using LENS.org for all studies included in the review.

An experienced information specialist (SM) designed the search strategies with input from the review team. The search includes terms related to the concepts of AI, adherence, and reporting. Several seed articles (based on articles known to the review team)11,13,1517,2124 were used to develop the MEDLINE search. The MEDLINE search was then translated and adapted for use in the other sources. The search strategy was iteratively tested to achieve an optimal balance between recall and precision. Full search strategies are available as Extended data (see Data availability section).25

Study selection

All records will first be deduplicated using the built-in functions of the reference management tools we will use (i.e., EndNote and Covidence). Two reviewers (out of MZ, SL, DPQC, JM, ML, LJ, KN, JO) will then independently screen all titles and abstracts, and records that are considered eligible or uncertain by either reviewer will undergo full-text screening, where those reviewers will independently assess the full text of potentially eligible records. Any disagreements will be resolved by discussion or consulting with a third reviewer. Title and abstract screening of bibliographic databases records will be conducted using Covidence. For arXiv and Cochrane Colloquium Abstracts, a screening form will be created in Microsoft Excel with the link for each record and the search date.

Data extraction

Two reviewers (out of MZ, SL, DPQC) will independently conduct the data extraction using a data extraction form (available as Extended data; see Data availability section).25 The data extraction form will be piloted by reviewers on a sample of included studies prior to the full data extraction process. Any discrepancies in the data collected between the two reviewers will be resolved via discussion or by consulting with a third reviewer (MJP or JEM). Data extraction will be conducted using a data extraction tool (REDCap version 15.5.30).26 Where necessary and available, additional sources will be consulted to supplement information extracted from the included studies, such as published study protocols, registry entries, or primary dataset documentation. If information remains missing or unclear, we will contact the study authors for further information. The information that will be extracted from each included study is provided in Table 1 (available as Extended data; see Data availability section).25

Quality assessment of included studies

To evaluate the quality of the included studies, two reviewers (out of MZ, SL, DPQC) will independently apply a defined set of quality indicators. These indicators are informed by established tools PROBAST+AI27 and the tool used in a living systematic review of AI tools for risk of bias assessment,28 which offer relevant concepts for assessing AI tools. The quality indicators will cover the following domains:

  • AI tool development

    Whether the AI tool was developed rigorously (e.g., adequate training model and prompt engineering).

  • Reference standard

    Whether the reference standard assessment was conducted rigorously (e.g., performed by trained assessors, assessed by at least two assessors independently with consensus procedures in place).

  • Independence of assessments and risk of data leakage

    Whether the AI tool was applied to the studies without knowledge of the reference standard assessment and vice versa; Whether the AI tool’s final performance was evaluated on an independent test set that was not used for model training or prompt development/refinement; Whether there was a low risk that the annotation of test corpus was part of the AI model’s training data.

  • Study planning

    Whether the study was based on a publicly available protocol or registration record.

Each indicator will be judged as low quality, high quality, or unclear quality. Quality assessment form is available as Extended data (see Data availability section).25 A study will be deemed high quality overall if all quality indicators were deemed high quality, low quality overall if at least one indicator was deemed low quality, and unclear quality overall if at least one indicator was deemed unclear quality, but none were deemed low quality. Disagreements between reviewers will be resolved through discussion or adjudication by a third reviewer.

Data syntheses and analyses

Given the anticipated diversity in AI tools, reporting guidelines, study designs, and outcome measures, formal meta-analysis is unlikely to be feasible across all outcomes. We will therefore present and summarise results of each of the included studies through structured tables and plots.

We will use structured tables to present study characteristics, reporting guidelines assessed and scope, dataset characteristics, dataset sources and formats, reference annotation for datasets, AI tool details, application of the AI tool, AI implementation details, and comparison details. Tables will be organised by reporting guideline evaluated, and then by the type of AI tool (traditional NLP models versus LLM-based/VLM-based models).

We will then present AI tool performance and utility findings in tables organised by the reporting guideline evaluated, stratified by the type of AI tool and each outcome category (i.e., classification performance metrics, agreement metrics and utility indicators). Where multiple metrics are reported within the same outcome category, we will extract pre-specified metrics as detailed in the Data extraction form (available as Extended data; see Data availability section).25 Where preferred metrics are unavailable, we will consider and note the alternative metrics reported by the study authors. We will summarise outcomes at overall level using descriptive statistics (e.g., mean, median, range across items) and also present the overall results in forest plots, stratified by reporting guideline and AI models. When item-level/recommendation-level outcomes are also available (e.g., classification performance metrics of adherence for each PRISMA item), we will summarise specific item-level results to facilitate performance interpretation using pre-specified rules, including the items with high and low performance (e.g., top and bottom five items for agreement metrics, accuracy and F1 score). When there are multiple results available for the same outcome across training, validation and test datasets, we will extract and summarise results identified by study authors as primary and/or the results from the most representative evaluation setting. In this circumstance, we will note that multiple results are available, and our reason for selecting the reported result.

We will finally present and summarise the overall quality of studies by the reporting guideline evaluated, stratified by the type of AI tool.

Dissemination plan

We plan to disseminate the findings of this systematic review through publication in a peer-reviewed scientific journal. The final manuscript will include all methods, results, and interpretations arising from the review to support transparency and reproducibility. In addition to journal publication, we will present the key findings at relevant academic conferences and seminars to reach researchers, and developers working in AI and reporting guidelines. We will also make our data extraction forms, summary tables, and analytical code publicly accessible to facilitate future research in this area.

Study status

This study is currently at study selection stage.

Discussion

Complete reporting of health-related research is important for the usability and trustworthiness of research evidence. Reporting guidelines have been widely used to support complete reporting. However, assessments of reporting guideline adherence remain inconsistent, time-consuming, and difficult to scale. AI tools have the potential to address these limitations. As the AI field continues to evolve rapidly, a rigorous evidence synthesis is timely. This systematic review will be the first to comprehensively summarise and synthesise what AI tools have been developed to automate assessments of reporting guideline adherence. It will provide interest holders with insights into what AI tools have been used, their implementation approaches, which AI tool types perform well, and any improvements that can be made to AI tools automating assessments of reporting guideline adherence in the future.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 28 Apr 2026
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Zeng M, Liu S, Clark DP et al. Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review [version 1; peer review: 2 approved with reservations]. F1000Research 2026, 15:626 (https://doi.org/10.12688/f1000research.179775.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 28 Apr 2026
Views
10
Cite
Reviewer Report 09 Jun 2026
Manuel Marques-Cruz, University of Porto, Porto, Porto District, Portugal 
Approved with Reservations
VIEWS 10
Dear authors,
First of all, I would like to congratulate you on a (unsurprisingly) well-designed protocol for a systematic review.
I do not see any major flaws with the study design you presented. However, I do have some ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Marques-Cruz M. Reviewer Report For: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review [version 1; peer review: 2 approved with reservations]. F1000Research 2026, 15:626 (https://doi.org/10.5256/f1000research.198323.r480816)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
39
Cite
Reviewer Report 25 May 2026
Mbonigaba Celestin, Brainae University, Delaware, USA 
Approved with Reservations
VIEWS 39
1. Abstract
1.1: Absence of protocol registration creates transparency concern
The abstract omits the fact that the review was not prospectively registered. This omission weakens methodological transparency.
The manuscript later states:

“We have not ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Celestin M. Reviewer Report For: Artificial intelligence tools for automating assessments of reporting guideline adherence: a protocol for a systematic review [version 1; peer review: 2 approved with reservations]. F1000Research 2026, 15:626 (https://doi.org/10.5256/f1000research.198323.r482643)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 28 Apr 2026
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.