Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol

Matthew J Page; Evan Mayo-Wilson; Minyan Zeng; David PQ Clark; Daniel G Hamilton; Phi-Yen Nguyen; Barbara Nussbaumer-Streit; Xiangji Ying; Halil Kilicoglu; Joanne E McKenzie

doi:10.12688/f1000research.180216.1

Home Browse Evaluation of automated assessments of systematic review adherence...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Study Protocol

Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol

[version 1; peer review: 1 approved with reservations]

Matthew J Page ¹, Evan Mayo-Wilson², Minyan Zeng¹, [...] David PQ Clark¹, Daniel G Hamilton¹, Phi-Yen Nguyen¹, Barbara Nussbaumer-Streit³, Xiangji Ying², Halil Kilicoglu⁴, Joanne E McKenzie¹

Matthew J Page ¹, Evan Mayo-Wilson², [...] Minyan Zeng¹, David PQ Clark¹, Daniel G Hamilton¹, Phi-Yen Nguyen¹, Barbara Nussbaumer-Streit³, Xiangji Ying², Halil Kilicoglu⁴, Joanne E McKenzie¹

PUBLISHED 04 May 2026

Author details Author details

¹ Methods in Evidence Synthesis Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria, 3002, Australia
² Department of Epidemiology, University of North Carolina Gillings School of Global Public Health, Chapel Hill, NC, 27514, USA
³ Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems an der Donau, 3500, Austria
⁴ School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, USA

Matthew J Page
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Evan Mayo-Wilson
Roles: Methodology, Writing – Review & Editing

Minyan Zeng
Roles: Methodology, Writing – Review & Editing

David PQ Clark
Roles: Methodology, Writing – Review & Editing

Daniel G Hamilton
Roles: Methodology, Writing – Review & Editing

Phi-Yen Nguyen
Roles: Methodology, Writing – Review & Editing

Barbara Nussbaumer-Streit
Roles: Methodology, Writing – Review & Editing

Xiangji Ying
Roles: Methodology, Writing – Review & Editing

Halil Kilicoglu
Roles: Methodology, Writing – Review & Editing

Joanne E McKenzie
Roles: Conceptualization, Methodology, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Research on Research, Policy & Culture gateway.

Abstract

Background

The Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 statement aims to help users write complete and accurate systematic review reports. Peer-reviewers and meta-researchers often assess adherence of systematic reviews to PRISMA 2020. This process can be time consuming, particularly when evaluating many systematic reviews. Automated approaches using large language models (LLMs) have the potential to accelerate this process and produce more comprehensive assessments.

Objective

To evaluate the performance of LLMs prompted to undertake comprehensive assessments of adherence to PRISMA 2020.

Methods

We will conduct a validation study using a diagnostic test accuracy analysis framework. We will assemble a sample of 200 published systematic reviews assessing the effects of interventions on human health, which will be divided into a training set and a test set. To assess adherence, we will reframe 95 reporting elements in PRISMA 2020 (which provide granular reporting recommendations for 41 checklist items) into one or more questions. We will iteratively develop few-shot prompts for use in several LLMs and refine them using systematic reviews included in the training set. Final LLM prompts will be applied to all systematic reviews in the test set and LLM responses compared with consensus responses of two humans (our reference standard). We will estimate the performance of each LLM assessment against the reference standard by calculating percentage agreement, Gwet’s Agreement Coefficient, sensitivity, specificity, positive predictive value, negative predictive value, and the F1 score.

Conclusion

We will determine which questions in our comprehensive tool for assessing adherence to PRISMA 2020 can be accurately automated by LLMs. This knowledge will help inform which questions need the most human oversight by meta-researchers, peer reviewers and other interest holders seeking to assess adherence.

Keywords

artificial intelligence; systematic review; reporting quality; PRISMA

Corresponding author: Matthew J Page

Competing interests: MJP, EMW and JEM are authors of the PRISMA 2020 statement but have no commercial interest in the use of the guideline. The remaining authors declare no competing interests.

Grant information: Monash University Early Career Research Excellence Program (ECREP) grant, National Health and Medical Research Council Investigator Grant- GNT2033917
National Health and Medical Research Council Investigator Grant- GNT2009612
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2026 Page MJ et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Page MJ, Mayo-Wilson E, Zeng M et al. Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol [version 1; peer review: 1 approved with reservations]. F1000Research 2026, 15:665 (https://doi.org/10.12688/f1000research.180216.1) First published: 04 May 2026, 15:665 (https://doi.org/10.12688/f1000research.180216.1) Latest published: 04 May 2026, 15:665 (https://doi.org/10.12688/f1000research.180216.1)

Introduction

Systematic reviews are used to synthesise evidence addressing particular research questions, which can inform health care decision making and policy, as well as guide future research priorities. However, the effort and resources underlying a systematic review is wasted if authors do not report completely and accurately what methods they used, what they found and to what populations and settings the findings apply.¹ For example, incomplete reporting of the characteristics of included studies can prevent clinicians from judging whether the review findings apply to patients they see, hindering the formulation of appropriate recommendations. Incomplete reporting can also impede efforts by researchers to assess the rigour of the methods employed, replicate the methods used, or verify and update the review.²

Reporting guidelines are designed to help authors ensure that research reports are complete and accurate. They typically comprise a checklist providing recommendations on what to report in a specific type of article, along with explanatory text and exemplars of reporting.³ The most widely used reporting guideline for systematic reviews is the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement, published originally in 2009⁴ and updated in 2020.^5,6 PRISMA 2020 includes 42 items (counting the sub-items, such as 10a and 10b, separately), which each provide a synopsis of what should be reported for a particular aspect of the review. Under each item sits one or more elements (183 in total), which provide detailed reporting recommendations within the item.

Various interest holders have used PRISMA 2020 as a tool to evaluate the completeness of reporting of published systematic reviews. Meta-researchers have undertaken studies to assess whether published systematic reviews adhere to PRISMA 2020, including investigating which recommendations are frequently not followed and factors that may predict adherence (e.g. time, clinical discipline, journal characteristics).⁷ Journal editors and peer reviewers perform assessments of PRISMA 2020 adherence to flag missing and unclear details that require rectification in manuscripts prior to publication.⁸

To date, assessments of adherence to PRISMA 2020 recommendations have relied on manual coding of systematic reviews, which is time consuming and resource intensive. Consequently, prior meta-research studies have been limited in size and scope (e.g. restricted to a particular intervention or condition).⁷ Furthermore, almost all studies have assessed adherence to PRISMA 2020 at the item-level, rather than at the more detailed element-level. Automated assessments of PRISMA 2020 adherence have the potential to scale up research about completeness of reporting, enabling evaluation of large numbers of reviews across interventions, conditions and time. Furthermore, automation has the potential to enable more comprehensive and granular assessments of adherence, facilitate rapid assessment during the peer review process, and provide insights into modifiable factors that might enhance reporting.

We are aware of three studies in which investigators used large language models (LLMs) to automate the assessment of systematic review adherence to PRISMA 2020^9–11; however, these studies have limitations. In all studies, the tool used to assess adherence was the PRISMA 2020 item-level checklist. However, the checklist is designed to guide authors on what to report in a systematic review, not on how readers of a systematic review should assess adherence, and so assessments across the studies using this instrument – by both humans and LLMs – were not standardised. Furthermore, the items provide a high-level summary of various reporting elements, so the LLM assessments in these studies were themselves high-level and did not capture adherence to the more detailed recommendations. In all studies, zero-shot prompting was performed (i.e. no explanations of relevant concepts or examples of optimal reporting were provided, which might have hampered performance). Finally, the samples of systematic reviews evaluated were restricted to particular health fields (acupuncture, emergency medicine, rehabilitation and ophthalmology).

To address these limitations, we aim to evaluate the performance of several LLMs prompted to undertake comprehensive assessments of adherence to PRISMA 2020 using an adherence tool, in a sample of systematic reviews selected regardless of the type of intervention or health field. We will perform few-shot prompting (i.e. include in the prompts examples of optimal reporting for each PRISMA recommendation) to guide model outputs.¹²

Methods

Overview

We will follow the RAISE (Responsible use of AI in evidence SynthEsis) guidance on building and evaluating AI evidence synthesis tools.¹³ To evaluate whether LLMs can accurately assess the adherence of systematic reviews to PRISMA 2020, we will conduct a validation study using a diagnostic test accuracy analysis framework. We will assemble a sample of published systematic reviews assessing the effects of interventions on human health, which will be divided into a training set and a test set (also known as a held-out set). We will design few-shot prompts for use in several LLMs and evaluate and refine the prompts using systematic reviews included in the training set. Once satisfied with the prompts, no further changes will be made to them, and we will apply the prompts to all systematic reviews in the test set. Two human assessors will independently assess all reviews in the test set manually using our PRISMA 2020 adherence tool (PRISMA-Check), which will include the same guidance and examples as those provided in the LLM prompts. We will compare performance of LLM assessments with consensus human assessments (our reference standard).

Identification and selection of systematic reviews

We will assemble a random sample of published systematic reviews that:

• Meet the PRISMA 2020 definition of a systematic review, that is, a review in which explicit, systematic methods are used to collate and synthesise studies that address a clearly formulated question.⁵ Systematic reviews will be eligible irrespective of whether they present a synthesis (e.g. meta-analysis), a structured summary (e.g. tabular or narrative description) of study findings, or both;
• Include studies evaluating the effects of one or more interventions on human health, irrespective of the type of study design (e.g. randomized trial, cohort study), outcome (e.g. continuous, binary) and effect measure used to quantify the intervention effect (e.g. mean difference, risk ratio);
• Were written in English and indexed in PubMed Central’s Open Access Subset¹⁴ in September 2025 (the month prior to study initiation).

To identify eligible systematic reviews, we will run the following search in PubMed: systematic [sb] AND pubmed pmc open access [filter] AND 2025/09/01:2025/09/30[EDAT]. The “systematic [sb]” component runs a search strategy designed to retrieve citations to systematic reviews in PubMed.¹⁵ All citations retrieved from the search will be exported into EndNote reference management software¹⁶ and sorted by record number (a unique identifier that EndNote automatically assigns to each record as it is added to a library). The first 500 sorted records will be imported into Covidence¹⁷ and one author (DPQC) will screen the titles and abstracts of each record. The same author will retrieve any potentially relevant full text reports (manually or via the automated article retrieval feature in EndNote) and screen each report. This step will be repeated as many times as needed until a target of 200 eligible systematic reviews is identified. Any uncertainties about eligibility will be discussed with the principal investigator (MJP). Prior to commencing each screening stage, two authors (DPQC and MJP) will independently pilot the screening process on 50 abstracts and 20 full text reports, respectively.

For each of the 200 included reviews, we will save the full-text HTML file and retrieve all associated supplementary materials in any format available. Each of the included reviews will be assigned a unique identifier based on its PubMed Central ID (PMCID). We will then use the = RAND() function in Microsoft Excel to assign a random number to each review and sort them by random number, with the first 100 being set aside for potential inclusion in the training set and the second 100 being included in the test set. We have selected a sample size of 100 systematic reviews for the test set to balance feasibility and precision. This sample size allows us to restrict the width of a 95% two-sided Wald-type normal confidence interval around the estimated percentage agreement between human and LLM responses to a maximum of 20% (i.e. precision ±10%), assuming a percentage agreement of 50%. For a percentage agreement of less (or greater) than 50%, the absolute width will be smaller; for example, an estimated percentage agreement of 90% would yield a confidence interval width of 12%.

To identify examples that could be included in few-shot prompts, we will assemble a sample of Cochrane reviews. These reviews will be chosen because they are more likely to adhere to PRISMA 2020 than non-Cochrane reviews for two primary reasons. Cochrane reviews are undertaken using the software Review Manager (RevMan), which includes a standard manuscript template (i.e. with section headings such as “Criteria for considering studies for this review” and “Synthesis methods”), and, additionally, guidance drawn from PRISMA 2020 on what to report in each of these sections. Also, Cochrane editors assess manuscripts to ensure they adhere to PRISMA 2020 items. We will identify Cochrane reviews indexed in PubMed Central’s Open Access Subset by running the following search in PubMed: (pubmed pmc open access [filter] AND cochrane database syst rev [so]) NOT protocol. One author (MJP) will screen records directly from the PubMed interface and retrieve the full text of the 50 most recently published Cochrane intervention reviews.

Development of tool used to assess adherence to PRISMA 2020 (PRISMA-Check)

We are currently developing PRISMA-Check, a tool to assess adherence to PRISMA 2020, which reframes each reporting element into one or more questions. For example, the following element for the item on the data collection process, “Report how many reviewers collected data from each report, whether multiple reviewers worked independently or not, and any processes used to resolve disagreements between data collectors”, has been reframed into four questions:

• 9.1a. Did the authors report how many reviewers collected data from each report?
• 9.1b. If yes to 9.1a, did they report that multiple reviewers collected data from each report?
• 9.1c. If yes to 9.1b, did they report whether reviewers worked independently or not?
• 9.1d. If yes to 9.1b, did they report any processes used to resolve disagreements between data collectors?

All questions have “Yes/No” response options, with some also having a “Not applicable” option. Responses to questions map to element- and item-level responses (specifically, if the answer to all applicable questions is “Yes”, the element will be rated as “reported”, and if all applicable elements are rated as “reported”, the item will be rated as “reported”). Each question will be accompanied by guidance on when to select each response option, along with examples of complete reporting. Examples will be sourced from the exemplars presented in the supplement to the PRISMA 2020 explanation and elaboration paper⁶ and the 50 Cochrane reviews retrieved. We will modify real examples if they are missing some details or lack clarity, and, if necessary, will use an LLM (Gemini 3) to invent examples when none exist in our sources. One investigator (DPQC) will source examples, and each will be verified for appropriateness by the primary investigator (MJP).

PRISMA-Check currently includes 315 questions addressing all elements (n = 171) corresponding to 41 of the 42 PRISMA 2020 items; the item focusing on the systematic review abstract is excluded. For reasons for feasibility, in this project we will assess 200 questions addressing a subset of the elements (n = 95) that correspond to the 41 items, with at least one element per item being assessed. The omitted subset of elements consists of those assumed by the principal investigator (MJP) to be difficult to assess by both humans and LLMs because, for example, there is more subjectivity in the judgement. In future, we will develop methods to automate the assessment of all questions in PRISMA-Check.

Text and non-text preprocessing of systematic reviews

Rather than directly uploading heterogeneous source files (text and figure) into an LLM interface, we will pre-process all source files for the systematic reviews into a structured format before submitting materials to the LLM via an API. The main text of the articles in HTML format will be parsed and standardised into a unified JavaScript Object Notation (JSON) structure with embedded figures encoded as URL objects. The main Python packages used in this pipeline will include “beautifulsoup4” and “html-to-markdown”. Supplementary files available in various file formats, such as PDF, DOCX, XLSX/CSV, and standalone figure files (TIFF, PNG, JPG), will be converted through format-specific pipelines into the same unified JSON structure. If all supplementary materials appear in a single PDF, we will use the Adobe PDF Services API (“adobe-pdfservices-sdk”) to extract each element and then the text will be standardised into the unified JSON format with figures encoded as base64 data. Text in the DOCX and XLSX/CSV files will be processed through format-specific pipelines with figures similarly encoded as base64 data. The main Python packages used in this pipeline will include “python-docx”, “openpyxl” and “base64”. Standalone supplementary figure files will also be encoded as base64 data. For figure files in unsupported formats (e.g. TIFF), the files will be first converted to PNG using the Python package “Pillow” and then encoded as base64 data. Following individual document conversion, the main article JSON file and all associated supplementary JSON files will be consolidated into the multimodal prompt format with textual and visual components indicated as explicit “input_text” and “input_image” elements presented in the order in which they appear in the documents. To assess the performance of the pipeline, JSON files will be generated for 15 systematic reviews randomly selected from the training set and will be reviewed and cross-checked against the source files by one investigator (MYZ).

Prompt engineering

Following OpenAI’s best practice guidance,¹⁸ we will develop few-shot prompts to guide the LLMs to answer each question selected from the PRISMA-Check tool. Each prompt will include a task description and the questions, guidance on when to select each response option and examples of complete reporting that appear in PRISMA-Check. We will instruct the model to provide for each question a response, relevant quotes appearing in the systematic review, and a rationale for the response. Prompts will be developed and optimised using GPT-5.4 first, and then further tested on an open-weight model (Qwen3.6 Plus).

We will use an iterative approach to prompt development. One investigator (MYZ) will draft prompts and evaluate them on 15 systematic reviews in the training set. Two investigators (DPQC, MZ) will then independently validate the responses made by the LLMs. Each investigator will consider the LLM’s output for each question asked and cross-check the output against the systematic review report and evaluate whether each response by the LLM was correct or not. Discrepancies will be resolved via discussion or, if necessary, via consultation with another investigator (MJP). The investigators will then identify key areas for improvement to the prompts and the PRISMA-Check tool, including to the wording of the questions, guidance on when to select each response option and the examples of complete reporting. We will revise the prompts accordingly and evaluate them on the 15 reviews previously evaluated and another 15 reviews in the training set. Part of this process will involve exploring how to ask questions most efficiently, for example by comparing accuracy and cost when using one lengthy prompt including all questions versus using 41 prompts asking the questions corresponding to each item. We will continue this revision-evaluation cycle on as many systematic reviews in the training set as is necessary for us to be confident that additional prompt refinements are unlikely to meaningfully improve response accuracy.

Manual assessment of systematic reviews

Each of the 100 systematic reviews in the test set will be assessed independently by two human assessors (DPQC will assess all 100 reviews while MYZ, DGH, BNS, PYN and MJP will each assess 20 reviews). Assessors will record several characteristics of the systematic reviews, including the journal that published the review, country of corresponding author, source of funding for the review, type of population evaluated (ICD-11 category), and type of intervention evaluated (drug, non-drug, or both). Assessors will then complete the PRISMA-Check tool, which will be administered via an online data collection tool (REDCap version 15.5.30¹⁹) and include the same questions, guidance and examples that appear in the final version of the LLM prompts. Investigators will extract relevant quotes or record the relevant table or figure title as supporting evidence for all questions. Investigators will be asked to consult only the systematic review article and supplementary file(s) provided to them and not to consult the review protocol (if cited) or any external website cited as hosting additional materials. Any discrepancies in responses will be resolved via discussion or, if necessary, via consultation with another investigator (MJP or JEM). Prior to undertaking manual assessments, investigators will independently pilot the assessment process on five systematic reviews to familiarise investigators with the PRISMA-Check tool.

LLM assessment of systematic reviews

We will evaluate our prompting approaches using the 100 systematic reviews included in the test set on three proprietary LLMs (GPT-5.4, Gemini 3.1 Pro Preview, and Claude Opus 4.6, which are the best performing LLMs as of April 2026 according to Artificial Analysis’ LLM Leaderboard²⁰) and one of the best performing open-source LLMs (Qwen3.6 Plus). Each of the LLMs will assess PRISMA 2020 adherence for each systematic review using the few-shot prompts developed during the training phase and will be instructed to generate a JSON file providing for each question a response, relevant quotes appearing in the systematic review, and a rationale for the response. We will adjust the default LLM parameters to minimise randomness (e.g. set the temperature to 0) to ensure the model adheres closely to the instructions and minimises improvisation. To measure resource use of each LLM we will record the number of tokens, cost and generation time for each systematic review assessed.

Data analysis

All analyses will be conducted using Python.²¹ We will calculate frequencies and percentages to summarise systematic review characteristics and median and interquartile range to summarise token counts, costs and generation time of each LLM. We will employ a diagnostic test accuracy analysis framework using the consensus response between the two human assessors as the reference standard. We will estimate the performance of each of the LLM assessments against the reference standard by calculating the following metrics: percentage agreement, Gwet’s Agreement Coefficient (AC),²² sensitivity, specificity, positive predictive value, negative predictive value, and F1 score¹³ (see Tables 1 and 2 for terminology and descriptions of each metric). These performance metrics will be calculated at the element level, at the item level, and as overall summaries across all elements and across all items. The overall performance estimates will be used to inform our conclusions on which LLMs (or combinations of LLM judgements) are acceptable. Prior to data analysis, we will specify thresholds that define acceptable performance for each metric. These thresholds will be determined through discussion with the authorship team. The performance estimates at the element level and item level will be used to identify elements and items where more prompting or examples may be required in future. We will calculate 95% confidence intervals for each of the above metrics. Confidence intervals for performance metrics calculated across all elements and across all items will be obtained using the bootstrapping percentile method with 5000 replications. We will allow for clustering of observations within systematic reviews by resampling the entire review.

Table 1. 2x2 (confusion) table defining terminology used in calculating performance metrics (see Table 2).

		LLM
		Reported	Not reported
Reference standard (human consensus judgement)	Reported	TP	FN
Reference standard (human consensus judgement)	Not reported	FP	TN

Table 2. Interpretation of performance metrics (adapted from Thomas et al.¹³).

Term	Interpretation in the context of current study
Percentage agreement	Measures the proportion of elements/items that are correctly classified by the LLM as reported or not reported relative to the human consensus judgements. Higher percentage agreement indicates that humans and LLMs are consistently producing the same judgement. The formula is: (TP + TN)/(TP + TN + FP + FN) (see Table 1 for definitions)
Gwet’s Agreement Coefficient (AC)	Measures the proportion of elements/items that are correctly classified by the LLM as reported or not reported relative to the human consensus judgements (corrected for chance)
Sensitivity (recall)	Measures the proportion of elements/items that are correctly classified by the LLM as reported. Higher sensitivity indicates that fewer elements/items are incorrectly identified by the LLM as not reported. The formula is: TP/(TP + FN)
Specificity	Measures the proportion of elements/items that are correctly classified by the LLM as not reported. Higher specificity indicates that fewer elements/items are incorrectly identified by the LLM as reported. The formula is: TN/(TN + FP)
Positive predictive value (precision)	Measures the proportion of elements/items that an LLM identified as reported that are actually reported. Higher positive predictive value indicates that when the LLM identifies an element/item as reported, it is likely to be correct. The formula is: TP/(TP + FP)
Negative predictive value	Measures the proportion of elements/items that an LLM identified as not reported that are actually not reported. Higher negative predictive value indicates that when the LLM identifies an element/item as not reported, it is likely to be correct. The formula is: TN/(TN + FN)
F1 score	Harmonic mean of sensitivity and precision. Higher F1 scores indicate better overall accuracy in identifying true positives while minimizing false positives and false negatives. The formula is: 2 × (Sensitivity x Precision) / (Sensitivity + Precision)

To generate each of the above metrics we will perform the following comparisons:

• GPT-5.4 response versus human consensus response;
• Gemini 3.1 Pro Preview response versus human consensus response;
• Claude Opus 4.6 response versus human consensus response;
• Qwen3.6 Plus response versus human consensus response;
• Response selected by at least three of the LLMs versus human consensus response;
• Response selected by all four of the LLMs versus human consensus response.

We will also investigate discrepancies between LLM responses and the human reference standard by systematically comparing the supporting evidence extracted by both LLMs and humans. This analysis will explore potential sources of disagreement, including differences in reasoning approaches or use of different supporting evidence. We acknowledge the inherent limitations of treating the human consensus responses as our reference standard and recognize that they do not constitute a perfect gold standard. Consequently, we will critically examine cases where LLMs diverge from human responses to determine whether the model outputs may, in some instances, be equally or more accurate.

Dissemination plan

We will disseminate the findings of this study via publication in a peer-reviewed scientific journal and at relevant academic conferences and seminars to reach researchers, journal editors, and developers of reporting guidelines and LLMs. We will make LLM prompts, the clean dataset, and analytic code publicly accessible to facilitate verification of our work and future research in this area.

Study status

As of 14 April 2026, we have run the searches and screened and identified 200 systematic reviews for inclusion in the training and test set, commenced drafting of the PRISMA-Check tool, tested the Python pipeline to pre-process all source materials for 15 systematic reviews, and commenced initial drafting of prompts.

Discussion

We believe the detailed assessments of PRISMA 2020 adherence performed in our study will be more informative to various interest holders than the item-level assessments performed in other studies. The PRISMA 2020 items were designed to be broad, with the elements providing the specific recommendations. For example, item 20c reads, “Present results of all investigations of possible causes of heterogeneity among study results”. Without consulting the elements for this item, humans and LLMs may interpret the term “results” differently; for example, does a “result” mean a P value for a test for interaction for a subgroup analysis, or does it mean the summary estimate and a measure of precision for each subgroup, or all of these? Furthermore, humans and LLMs might provide the same response to a question about adherence to this item yet be drawing on different supporting evidence, depending on how they have interpreted the question. By reframing the items into questions and providing guidance for when to select each response option and examples of optimal reporting to guide human and LLM responses, we hypothesise that this approach will yield improved performance estimates as compared with those observed in previous studies.

Our study will determine which questions in the PRISMA-Check tool can be accurately automated by LLMs. This knowledge will help inform which questions need the most human oversight by meta-researchers, peer reviewers and other interest holders seeking to assess adherence. We recognise that the number of questions in our adherence tool is large and that authors and editors of a systematic review might not want a report providing many Yes/No judgements. Future research could build on our planned work to identify how best to consolidate the adherence assessments and to develop tools that provide constructive feedback to authors for improving their systematic reviews.

Data availability

This is a study protocol and so there are no data.

Software availability

This is a study protocol and so there is no analytic code.

References

1. Glasziou P, Altman DG, Bossuyt P, et al.: Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014; 383(9913): 267–276. Publisher Full Text
2. Hamilton DG, McKenzie JE, Nguyen P-Y, et al.: Evaluation of the replicability of systematic reviews with meta-analyses of the effects of health interventions. Res. Synth. Methods. 2026; 1–19.
3. Moher D, Schulz KF, Simera I, et al.: Guidance for developers of health research reporting guidelines. PLoS Med. 2010; 7(2): e1000217. PubMed Abstract | Publisher Full Text | Free Full Text
4. Moher D, Liberati A, Tetzlaff J, et al.: Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009; 6(7): e1000097. PubMed Abstract | Publisher Full Text | Free Full Text
5. Page MJ, McKenzie JE, Bossuyt PM, et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021; 372: n71. Publisher Full Text
6. Page MJ, Moher D, Bossuyt PM, et al.: PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021; 372: n160. Publisher Full Text
7. Hamilton DG, McKenzie JE, Nejstgaard CH, et al.: Evaluation of tools used to assess adherence to PRISMA 2020 reveals inconsistent methods and poor tool implementability: part I of a systematic review. J. Clin. Epidemiol. 2026; 192: 112133.
8. Puljak L, Pintur S, Rombey T, et al.: Use of structured tools by peer reviewers of systematic reviews: a cross-sectional study reveals high familiarity with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) but limited use of other tools. J. Clin. Epidemiol. 2026; 190: 112084. PubMed Abstract | Publisher Full Text
9. Forero DA, Abreu SE, Tovar BE, et al.: Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). J. Med. Syst. 2025; 49(1): 80. PubMed Abstract | Publisher Full Text | Free Full Text
10. Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking.2025 November 01. 2025:[arXiv:2511.16707 p.]. Reference Source
11. Lee SY, Hong JS, Lee SH, et al.: Compliance of systematic reviews and meta-analyses in ophthalmology with the PRISMA statement: an AI-based assessment and longitudinal comparison with 2017 data. BMC Med. Res. Methodol. 2026. PubMed Abstract | Publisher Full Text
12. Brown TB, Mann B, Ryder N, et al.; Language Models are Few-Shot Learners.2020 May 01. 2020:[arXiv:2005.14165 p.]. Reference Source
13. Thomas J, Hair K, Noel-Storr A, et al.: Responsible use of AI in evidence SynthEsis (RAISE 2026): building and evaluating AI evidence synthesis tools (version 3; updated 13 March 2026). Open Science Framework. Washington DC: Center for Open Science. Reference Source
14. PubMed Central’s Open Access Subset. http
15. National Library of Medicine. Search Strategy Used to Create the PubMed Systematic Reviews Filter.2019. Reference Source
16. The EndNote TeamEndNote: EndNote 2025 (v22.0.0.19000) ed. Philadelphia, PA: Clarivate; 2013.
17. Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. http
18. OpenAI: Best practices for prompt engineering with the OpenAI API. OpenAI Help Center; 2026. Reference Source
19. Harris PA, Taylor R, Thielke R, et al.: Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 2009; 42(2): 377–381. PubMed Abstract | Publisher Full Text | Free Full Text
20. Artificial Analysis LLM Leaderboard. http
21. Python Software Foundation: Python (Version 3.14.3) [Computer software].Reference Source2026.
22. Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977; 33(1): 159–174. Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 04 May 2026

Author details Author details

¹ Methods in Evidence Synthesis Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Victoria, 3002, Australia
² Department of Epidemiology, University of North Carolina Gillings School of Global Public Health, Chapel Hill, NC, 27514, USA
³ Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems an der Donau, 3500, Austria
⁴ School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, USA

Matthew J Page
Roles: Conceptualization, Funding Acquisition, Methodology, Project Administration, Resources, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Evan Mayo-Wilson
Roles: Methodology, Writing – Review & Editing

Minyan Zeng
Roles: Methodology, Writing – Review & Editing

David PQ Clark
Roles: Methodology, Writing – Review & Editing

Daniel G Hamilton
Roles: Methodology, Writing – Review & Editing

Phi-Yen Nguyen
Roles: Methodology, Writing – Review & Editing

Barbara Nussbaumer-Streit
Roles: Methodology, Writing – Review & Editing

Xiangji Ying
Roles: Methodology, Writing – Review & Editing

Halil Kilicoglu
Roles: Methodology, Writing – Review & Editing

Joanne E McKenzie
Roles: Conceptualization, Methodology, Writing – Review & Editing

Competing interests

MJP, EMW and JEM are authors of the PRISMA 2020 statement but have no commercial interest in the use of the guideline. The remaining authors declare no competing interests.

Grant information

Monash University Early Career Research Excellence Program (ECREP) grant, National Health and Medical Research Council Investigator Grant- GNT2033917
National Health and Medical Research Council Investigator Grant- GNT2009612
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 04 May 2026, 15:665

https://doi.org/10.12688/f1000research.180216.1

Copyright

© 2026 Page MJ et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Page MJ, Mayo-Wilson E, Zeng M et al. Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol [version 1; peer review: 1 approved with reservations]. F1000Research 2026, 15:665 (https://doi.org/10.12688/f1000research.180216.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 04 May 2026

Views

14

Reviewer Report 22 May 2026

Ahmed S. BaHammam, King Saud University, Riyadh, Riyadh Province, Saudi Arabia

Approved with Reservations

https://doi.org/10.5256/f1000research.198806.r482712

Overall
Manual PRISMA 2020 checking is slow work, and the number of systematic reviews published each month makes it harder each year. So the question this protocol asks is worth asking. The authors push the assessment down from ... Continue reading

Overall
Manual PRISMA 2020 checking is slow work, and the number of systematic reviews published each month makes it harder each year. So the question this protocol asks is worth asking. The authors push the assessment down from items to elements, run four LLMs in parallel, and add few-shot prompting, a step beyond what earlier studies attempted. The diagnostic accuracy framing is the right framing. The problem is not with the idea. It is that several decisions in the protocol are still loose enough that the final numbers will be hard to interpret. I recommend a major revision.
1. Originality
Incremental, in my view. The earlier LLM-PRISMA studies remained at the item level, used zero-shot prompts, and operated within narrow clinical areas. This one is finer-grained (95 of 171 elements), adds worked examples to the prompts, includes an open-weight model, and samples across health fields rather than within a single specialty. PRISMA-Check is a useful piece of infrastructure, but it is being built at the same time as the validation runs on it, which means it sits awkwardly as both instrument and object of study. The authors should say more directly what is new here beyond going broader and deeper.
2. Major methodological concerns
The sampling is not random in the way the authors describe it. They sort PubMed records by EndNote record number, screen the first 500, and repeat until 200 reviews are eligible. EndNote record numbers come from the import order. The RAND() function only appears later, when the 200 reviews are already chosen, and is used to split training from test. So selection from the eligible pool is not actually randomized. Randomization needs to happen before screening, across the full set of records returned by the PubMed search.
The reference standard needs more work. Two assessors plus consensus is fine in principle. What I cannot find in the protocol is how the assessors will be trained, how calibration will be checked, what the pre-consensus agreement looks like, or how disagreements get resolved beyond "discussion." On top of that, one assessor will rate all 100 test reviews while five others rate 20 each. That is a recipe for the dominant assessor's interpretation quietly setting the standard. A five-review pilot is not enough to align six people on 200 questions.
PRISMA-Check is being validated at the same time as the LLMs. Humans and LLMs receive the same guidance and the same examples. So if agreement turns out high, you cannot tell whether the LLMs are tracking PRISMA 2020 adherence or just learning to echo the wording of PRISMA-Check. I would like the authors to say what they would accept as evidence for the former.
The 76 excluded elements are a bigger issue than the protocol treats them as. They were chosen because one investigator thought they were difficult. Difficulty is exactly where users will want LLM help, so cutting those out makes the validation easier on the models. The abstract, keywords, and conclusion should all explicitly state that this is a subset, not all of PRISMA 2020, and a table of the excluded elements with short reasons should appear in the paper.
Thresholds belong in the protocol, not in a later team meeting. Saying that acceptable performance will be defined "before analysis" still leaves room for those thresholds to drift after the team has seen some preliminary numbers, even unintentionally. Put them in now, and tie them to use cases: a peer-review tool that misses things is worse than one that flags too much, so sensitivity matters more there; a meta-research tool can trade differently.
Not Applicable responses are not handled. Some questions allow NA, but the 2x2 table is binary. The protocol needs to say what happens to NA at the question level, at the element level, at the item level, and in the overall summaries. Drop them, keep them, treat them as a third category? The choice will change prevalence, predictive values, and F1, especially for elements where applicability is genuinely variable (registration, certainty assessment, synthesis methods).
Reproducibility of the LLM runs is fragile as written. "GPT-5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6, Qwen3.6 Plus" plus a leaderboard snapshot is not enough. The word "Preview" alone tells you the model could shift mid-study. The authors should commit, in this protocol, to recording exact API version strings, access dates, parameters used, seeds where available, the full prompts, and to archiving raw model outputs with the data release.
Preprocessing has been validated lightly. The pipeline handles HTML, PDF, DOCX, XLSX/CSV, and several image formats, with figures encoded inline. One investigator cross-checked 15 training reviews. Given the format zoo, and the reliance on the Adobe PDF Services API for compound supplementary PDFs, that is not much. I would want to see a definition of what counts as a successful extraction, a way to detect parsing failures during the test phase, and a rule for what to do when supplementary material cannot be converted cleanly. If a preprocessing failure quietly becomes a model error in the analysis, performance gets understated in ways nobody can audit.
The stopping rule for prompt development is vibes-based. "We will continue until further refinement is unlikely to help" is honest, but it is not a rule. Pick one: a maximum number of refinement cycles, a minimum improvement increment, or a plateau on a held-back slice of the training set. The authors should also state plainly that no test-set review will be opened, viewed, or skimmed by anyone during prompt development.
Majority-vote logic is unfinished. For the "all four LLMs agree" comparison, what happens when one or more disagree? Are those cases dropped, counted as wrong by default, or treated as missing? The decision shapes apparent performance and favors different ensembles. Sort this out before the runs.
Weighing in the overall summaries needs stating. Items vary in how many elements they contain; elements vary in how many questions. Item 13 has six sub-items. Item 1 has one. If overall agreement just pools questions, item 13 quietly dominates the result. The authors should say whether they will weight by question, by element, by item, or by review, and ideally report more than one weighting.
3. Data analysis
The metric set is sensible, and including Gwet's AC alongside percentage agreement is the right call given the prevalence imbalance you would expect across elements. Predictive values will vary with the prevalence of "reported," which varies by element, so PPV and NPV should be reported with their respective prevalences and interpreted with that in mind. Sparse cells will appear (some elements are nearly always reported, some almost never), and the protocol should say in advance how zero cells, undefined estimates, and missing LLM outputs will be handled.
The plan to compare the supporting evidence pulled out by humans and LLMs is, for me, the most useful part of the analysis. It would be stronger with a predefined taxonomy of error types: retrieval failure, misreading of the criterion, hallucinated quotes, mishandling of tables or supplementary material, and genuinely ambiguous reporting. Otherwise, the discrepancy analysis stays descriptive when it could be diagnostic.
4. Interpretation and discussion
The authors do note that human consensus is not a true gold standard. Good. That caveat needs to survive into the conclusion, which currently reads as if the study will show which questions LLMs can "accurately" automate. What it can actually show is which questions four LLMs answer in agreement with two trained humans using PRISMA-Check, on PubMed Central Open Access intervention reviews from one month in 2025. The boundaries should be stated cleanly in the discussion: English only, open access only, one indexing month, and intervention reviews of human health only. Paywalled reviews, non-English reviews, diagnostic accuracy reviews, prognostic reviews, scoping reviews, and non-health reviews; none of these are covered.
5. Validation of measurements and biomarkers
Not relevant here. No biological measurements are involved. The instruments that need validating are PRISMA-Check, the preprocessing pipeline, and the human reference standard, all dealt with above.
6. Presentation
The writing is clear. Two things would help readers. A flow diagram covering review identification, training and test allocation, prompt development, dual human assessment, LLM assessment and analysis. And a table listing the 95 elements that are in and the 76 that are out, with short reasons for the exclusions. The abstract and keywords should make the subset framing visible, not buried in the methods.
One small inconsistency worth fixing. The abstract refers to 95 elements covering 41 items. The introduction says PRISMA 2020 has 42 items and 183 elements. The methods say PRISMA-Check covers 171 elements across 41 items. A single sentence relating the three numbers (183 in PRISMA 2020, 171 in PRISMA-Check after dropping the abstract item, 95 evaluated here) would clear it up.
Recommendation
Major revision. The idea is good and the team has the right people on it. The weaknesses cluster in a few places: how the sample is drawn, how solid the human reference standard is, what gets prespecified (thresholds, NA handling, weighting, stopping rules), how transparently the exclusions are presented, and how reproducible the LLM runs will be a year from now. None of these is hard to fix individually. Together, they decide how much weight the results will carry.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

References

1. Page M, Mayo-Wilson E, Zeng M, Clark D, et al.: Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol. F1000Research. 2026; 15. Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Interest in AI applications in Medicine, Scientific Writing, Peer Review, and Editorialship

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 04 May 2026

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1
Version 1 04 May 26	read

Ahmed S. BaHammam, King Saud University, Riyadh, Saudi Arabia

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

14 Views

22 May 2026 | for Version 1

Ahmed S. BaHammam, King Saud University, Riyadh, Riyadh Province, Saudi Arabia

14 Views Cite this report Responses(0)

Approved With Reservations

Overall
Manual PRISMA 2020 checking is slow work, and the number of systematic reviews published each month makes it harder each year. So the question this protocol asks is worth asking. The authors push the assessment down from items to elements, run four LLMs in parallel, and add few-shot prompting, a step beyond what earlier studies attempted. The diagnostic accuracy framing is the right framing. The problem is not with the idea. It is that several decisions in the protocol are still loose enough that the final numbers will be hard to interpret. I recommend a major revision.
1. Originality
Incremental, in my view. The earlier LLM-PRISMA studies remained at the item level, used zero-shot prompts, and operated within narrow clinical areas. This one is finer-grained (95 of 171 elements), adds worked examples to the prompts, includes an open-weight model, and samples across health fields rather than within a single specialty. PRISMA-Check is a useful piece of infrastructure, but it is being built at the same time as the validation runs on it, which means it sits awkwardly as both instrument and object of study. The authors should say more directly what is new here beyond going broader and deeper.
2. Major methodological concerns
The sampling is not random in the way the authors describe it. They sort PubMed records by EndNote record number, screen the first 500, and repeat until 200 reviews are eligible. EndNote record numbers come from the import order. The RAND() function only appears later, when the 200 reviews are already chosen, and is used to split training from test. So selection from the eligible pool is not actually randomized. Randomization needs to happen before screening, across the full set of records returned by the PubMed search.
The reference standard needs more work. Two assessors plus consensus is fine in principle. What I cannot find in the protocol is how the assessors will be trained, how calibration will be checked, what the pre-consensus agreement looks like, or how disagreements get resolved beyond "discussion." On top of that, one assessor will rate all 100 test reviews while five others rate 20 each. That is a recipe for the dominant assessor's interpretation quietly setting the standard. A five-review pilot is not enough to align six people on 200 questions.
PRISMA-Check is being validated at the same time as the LLMs. Humans and LLMs receive the same guidance and the same examples. So if agreement turns out high, you cannot tell whether the LLMs are tracking PRISMA 2020 adherence or just learning to echo the wording of PRISMA-Check. I would like the authors to say what they would accept as evidence for the former.
The 76 excluded elements are a bigger issue than the protocol treats them as. They were chosen because one investigator thought they were difficult. Difficulty is exactly where users will want LLM help, so cutting those out makes the validation easier on the models. The abstract, keywords, and conclusion should all explicitly state that this is a subset, not all of PRISMA 2020, and a table of the excluded elements with short reasons should appear in the paper.
Thresholds belong in the protocol, not in a later team meeting. Saying that acceptable performance will be defined "before analysis" still leaves room for those thresholds to drift after the team has seen some preliminary numbers, even unintentionally. Put them in now, and tie them to use cases: a peer-review tool that misses things is worse than one that flags too much, so sensitivity matters more there; a meta-research tool can trade differently.
Not Applicable responses are not handled. Some questions allow NA, but the 2x2 table is binary. The protocol needs to say what happens to NA at the question level, at the element level, at the item level, and in the overall summaries. Drop them, keep them, treat them as a third category? The choice will change prevalence, predictive values, and F1, especially for elements where applicability is genuinely variable (registration, certainty assessment, synthesis methods).
Reproducibility of the LLM runs is fragile as written. "GPT-5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6, Qwen3.6 Plus" plus a leaderboard snapshot is not enough. The word "Preview" alone tells you the model could shift mid-study. The authors should commit, in this protocol, to recording exact API version strings, access dates, parameters used, seeds where available, the full prompts, and to archiving raw model outputs with the data release.
Preprocessing has been validated lightly. The pipeline handles HTML, PDF, DOCX, XLSX/CSV, and several image formats, with figures encoded inline. One investigator cross-checked 15 training reviews. Given the format zoo, and the reliance on the Adobe PDF Services API for compound supplementary PDFs, that is not much. I would want to see a definition of what counts as a successful extraction, a way to detect parsing failures during the test phase, and a rule for what to do when supplementary material cannot be converted cleanly. If a preprocessing failure quietly becomes a model error in the analysis, performance gets understated in ways nobody can audit.
The stopping rule for prompt development is vibes-based. "We will continue until further refinement is unlikely to help" is honest, but it is not a rule. Pick one: a maximum number of refinement cycles, a minimum improvement increment, or a plateau on a held-back slice of the training set. The authors should also state plainly that no test-set review will be opened, viewed, or skimmed by anyone during prompt development.
Majority-vote logic is unfinished. For the "all four LLMs agree" comparison, what happens when one or more disagree? Are those cases dropped, counted as wrong by default, or treated as missing? The decision shapes apparent performance and favors different ensembles. Sort this out before the runs.
Weighing in the overall summaries needs stating. Items vary in how many elements they contain; elements vary in how many questions. Item 13 has six sub-items. Item 1 has one. If overall agreement just pools questions, item 13 quietly dominates the result. The authors should say whether they will weight by question, by element, by item, or by review, and ideally report more than one weighting.
3. Data analysis
The metric set is sensible, and including Gwet's AC alongside percentage agreement is the right call given the prevalence imbalance you would expect across elements. Predictive values will vary with the prevalence of "reported," which varies by element, so PPV and NPV should be reported with their respective prevalences and interpreted with that in mind. Sparse cells will appear (some elements are nearly always reported, some almost never), and the protocol should say in advance how zero cells, undefined estimates, and missing LLM outputs will be handled.
The plan to compare the supporting evidence pulled out by humans and LLMs is, for me, the most useful part of the analysis. It would be stronger with a predefined taxonomy of error types: retrieval failure, misreading of the criterion, hallucinated quotes, mishandling of tables or supplementary material, and genuinely ambiguous reporting. Otherwise, the discrepancy analysis stays descriptive when it could be diagnostic.
4. Interpretation and discussion
The authors do note that human consensus is not a true gold standard. Good. That caveat needs to survive into the conclusion, which currently reads as if the study will show which questions LLMs can "accurately" automate. What it can actually show is which questions four LLMs answer in agreement with two trained humans using PRISMA-Check, on PubMed Central Open Access intervention reviews from one month in 2025. The boundaries should be stated cleanly in the discussion: English only, open access only, one indexing month, and intervention reviews of human health only. Paywalled reviews, non-English reviews, diagnostic accuracy reviews, prognostic reviews, scoping reviews, and non-health reviews; none of these are covered.
5. Validation of measurements and biomarkers
Not relevant here. No biological measurements are involved. The instruments that need validating are PRISMA-Check, the preprocessing pipeline, and the human reference standard, all dealt with above.
6. Presentation
The writing is clear. Two things would help readers. A flow diagram covering review identification, training and test allocation, prompt development, dual human assessment, LLM assessment and analysis. And a table listing the 95 elements that are in and the 76 that are out, with short reasons for the exclusions. The abstract and keywords should make the subset framing visible, not buried in the methods.
One small inconsistency worth fixing. The abstract refers to 95 elements covering 41 items. The introduction says PRISMA 2020 has 42 items and 183 elements. The methods say PRISMA-Check covers 171 elements across 41 items. A single sentence relating the three numbers (183 in PRISMA 2020, 171 in PRISMA-Check after dropping the abstract item, 95 evaluated here) would clear it up.
Recommendation
Major revision. The idea is good and the team has the right people on it. The weaknesses cluster in a few places: how the sample is drawn, how solid the human reference standard is, what gets prespecified (thresholds, NA handling, weighting, stopping rules), how transparently the exclusions are presented, and how reproducible the LLM runs will be a year from now. None of these is hard to fix individually. Together, they decide how much weight the results will carry.

Is the rationale for, and objectives of, the study clearly described?

Yes
Is the study design appropriate for the research question?

Partly
Are sufficient details of the methods provided to allow replication by others?

Partly
Are the datasets clearly presented in a useable and accessible format?

Not applicable

References

1. Page M, Mayo-Wilson E, Zeng M, Clark D, et al.: Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol. F1000Research. 2026; 15. Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Interest in AI applications in Medicine, Scientific Writing, Peer Review, and Editorialship

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] 1. Glasziou P, Altman DG, Bossuyt P, et al.: Reducing waste from incomplete or unusable reports of biomedical research. Lancet. 2014; 383(9913): 267–276. Publisher Full Text

[2] 2. Hamilton DG, McKenzie JE, Nguyen P-Y, et al.: Evaluation of the replicability of systematic reviews with meta-analyses of the effects of health interventions. Res. Synth. Methods. 2026; 1–19.

[3] 3. Moher D, Schulz KF, Simera I, et al.: Guidance for developers of health research reporting guidelines. PLoS Med. 2010; 7(2): e1000217. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Moher D, Liberati A, Tetzlaff J, et al.: Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement. PLoS Med. 2009; 6(7): e1000097. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Page MJ, McKenzie JE, Bossuyt PM, et al.: The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021; 372: n71. Publisher Full Text

[6] 6. Page MJ, Moher D, Bossuyt PM, et al.: PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021; 372: n160. Publisher Full Text

[7] 7. Hamilton DG, McKenzie JE, Nejstgaard CH, et al.: Evaluation of tools used to assess adherence to PRISMA 2020 reveals inconsistent methods and poor tool implementability: part I of a systematic review. J. Clin. Epidemiol. 2026; 192: 112133.

[8] 8. Puljak L, Pintur S, Rombey T, et al.: Use of structured tools by peer reviewers of systematic reviews: a cross-sectional study reveals high familiarity with Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) but limited use of other tools. J. Clin. Epidemiol. 2026; 190: 112084. PubMed Abstract | Publisher Full Text

[9] 9. Forero DA, Abreu SE, Tovar BE, et al.: Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR). J. Med. Syst. 2025; 49(1): 80. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Kataoka Y, So R, Banno M, et al.: Large language models for automated PRISMA 2020 adherence checking.2025 November 01. 2025:[arXiv:2511.16707 p.]. Reference Source

[11] 11. Lee SY, Hong JS, Lee SH, et al.: Compliance of systematic reviews and meta-analyses in ophthalmology with the PRISMA statement: an AI-based assessment and longitudinal comparison with 2017 data. BMC Med. Res. Methodol. 2026. PubMed Abstract | Publisher Full Text

[12] 12. Brown TB, Mann B, Ryder N, et al.; Language Models are Few-Shot Learners.2020 May 01. 2020:[arXiv:2005.14165 p.]. Reference Source

[13] 13. Thomas J, Hair K, Noel-Storr A, et al.: Responsible use of AI in evidence SynthEsis (RAISE 2026): building and evaluating AI evidence synthesis tools (version 3; updated 13 March 2026). Open Science Framework. Washington DC: Center for Open Science. Reference Source

[14] 14. PubMed Central’s Open Access Subset. http

[15] 15. National Library of Medicine. Search Strategy Used to Create the PubMed Systematic Reviews Filter.2019. Reference Source

[16] 16. The EndNote TeamEndNote: EndNote 2025 (v22.0.0.19000) ed. Philadelphia, PA: Clarivate; 2013.

[17] 17. Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. http

[18] 18. OpenAI: Best practices for prompt engineering with the OpenAI API. OpenAI Help Center; 2026. Reference Source

[19] 19. Harris PA, Taylor R, Thielke R, et al.: Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. J. Biomed. Inform. 2009; 42(2): 377–381. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Artificial Analysis LLM Leaderboard. http

[21] 21. Python Software Foundation: Python (Version 3.14.3) [Computer software].Reference Source2026.

[22] 22. Landis JR, Koch GG: The measurement of observer agreement for categorical data. Biometrics. 1977; 33(1): 159–174. Publisher Full Text

Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol

Abstract

Background

Objective

Methods

Conclusion

Keywords

Introduction

Methods

Overview

Identification and selection of systematic reviews

Development of tool used to assess adherence to PRISMA 2020 (PRISMA-Check)

Text and non-text preprocessing of systematic reviews

Prompt engineering

Manual assessment of systematic reviews

LLM assessment of systematic reviews

Data analysis

Table 1. 2x2 (confusion) table defining terminology used in calculating performance metrics (see Table 2).

Table 2. Interpretation of performance metrics (adapted from Thomas et al.13).

Dissemination plan

Study status

Discussion

Data availability

Software availability

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Table 2. Interpretation of performance metrics (adapted from Thomas et al.¹³).