Keywords
artificial intelligence; systematic review; reporting quality; PRISMA
This article is included in the Research on Research, Policy & Culture gateway.
The Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 statement aims to help users write complete and accurate systematic review reports. Peer-reviewers and meta-researchers often assess adherence of systematic reviews to PRISMA 2020. This process can be time consuming, particularly when evaluating many systematic reviews. Automated approaches using large language models (LLMs) have the potential to accelerate this process and produce more comprehensive assessments.
To evaluate the performance of LLMs prompted to undertake comprehensive assessments of adherence to PRISMA 2020.
We will conduct a validation study using a diagnostic test accuracy analysis framework. We will assemble a sample of 200 published systematic reviews assessing the effects of interventions on human health, which will be divided into a training set and a test set. To assess adherence, we will reframe 95 reporting elements in PRISMA 2020 (which provide granular reporting recommendations for 41 checklist items) into one or more questions. We will iteratively develop few-shot prompts for use in several LLMs and refine them using systematic reviews included in the training set. Final LLM prompts will be applied to all systematic reviews in the test set and LLM responses compared with consensus responses of two humans (our reference standard). We will estimate the performance of each LLM assessment against the reference standard by calculating percentage agreement, Gwet’s Agreement Coefficient, sensitivity, specificity, positive predictive value, negative predictive value, and the F1 score.
We will determine which questions in our comprehensive tool for assessing adherence to PRISMA 2020 can be accurately automated by LLMs. This knowledge will help inform which questions need the most human oversight by meta-researchers, peer reviewers and other interest holders seeking to assess adherence.
artificial intelligence; systematic review; reporting quality; PRISMA
Systematic reviews are used to synthesise evidence addressing particular research questions, which can inform health care decision making and policy, as well as guide future research priorities. However, the effort and resources underlying a systematic review is wasted if authors do not report completely and accurately what methods they used, what they found and to what populations and settings the findings apply.1 For example, incomplete reporting of the characteristics of included studies can prevent clinicians from judging whether the review findings apply to patients they see, hindering the formulation of appropriate recommendations. Incomplete reporting can also impede efforts by researchers to assess the rigour of the methods employed, replicate the methods used, or verify and update the review.2
Reporting guidelines are designed to help authors ensure that research reports are complete and accurate. They typically comprise a checklist providing recommendations on what to report in a specific type of article, along with explanatory text and exemplars of reporting.3 The most widely used reporting guideline for systematic reviews is the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) statement, published originally in 20094 and updated in 2020.5,6 PRISMA 2020 includes 42 items (counting the sub-items, such as 10a and 10b, separately), which each provide a synopsis of what should be reported for a particular aspect of the review. Under each item sits one or more elements (183 in total), which provide detailed reporting recommendations within the item.
Various interest holders have used PRISMA 2020 as a tool to evaluate the completeness of reporting of published systematic reviews. Meta-researchers have undertaken studies to assess whether published systematic reviews adhere to PRISMA 2020, including investigating which recommendations are frequently not followed and factors that may predict adherence (e.g. time, clinical discipline, journal characteristics).7 Journal editors and peer reviewers perform assessments of PRISMA 2020 adherence to flag missing and unclear details that require rectification in manuscripts prior to publication.8
To date, assessments of adherence to PRISMA 2020 recommendations have relied on manual coding of systematic reviews, which is time consuming and resource intensive. Consequently, prior meta-research studies have been limited in size and scope (e.g. restricted to a particular intervention or condition).7 Furthermore, almost all studies have assessed adherence to PRISMA 2020 at the item-level, rather than at the more detailed element-level. Automated assessments of PRISMA 2020 adherence have the potential to scale up research about completeness of reporting, enabling evaluation of large numbers of reviews across interventions, conditions and time. Furthermore, automation has the potential to enable more comprehensive and granular assessments of adherence, facilitate rapid assessment during the peer review process, and provide insights into modifiable factors that might enhance reporting.
We are aware of three studies in which investigators used large language models (LLMs) to automate the assessment of systematic review adherence to PRISMA 20209–11; however, these studies have limitations. In all studies, the tool used to assess adherence was the PRISMA 2020 item-level checklist. However, the checklist is designed to guide authors on what to report in a systematic review, not on how readers of a systematic review should assess adherence, and so assessments across the studies using this instrument – by both humans and LLMs – were not standardised. Furthermore, the items provide a high-level summary of various reporting elements, so the LLM assessments in these studies were themselves high-level and did not capture adherence to the more detailed recommendations. In all studies, zero-shot prompting was performed (i.e. no explanations of relevant concepts or examples of optimal reporting were provided, which might have hampered performance). Finally, the samples of systematic reviews evaluated were restricted to particular health fields (acupuncture, emergency medicine, rehabilitation and ophthalmology).
To address these limitations, we aim to evaluate the performance of several LLMs prompted to undertake comprehensive assessments of adherence to PRISMA 2020 using an adherence tool, in a sample of systematic reviews selected regardless of the type of intervention or health field. We will perform few-shot prompting (i.e. include in the prompts examples of optimal reporting for each PRISMA recommendation) to guide model outputs.12
We will follow the RAISE (Responsible use of AI in evidence SynthEsis) guidance on building and evaluating AI evidence synthesis tools.13 To evaluate whether LLMs can accurately assess the adherence of systematic reviews to PRISMA 2020, we will conduct a validation study using a diagnostic test accuracy analysis framework. We will assemble a sample of published systematic reviews assessing the effects of interventions on human health, which will be divided into a training set and a test set (also known as a held-out set). We will design few-shot prompts for use in several LLMs and evaluate and refine the prompts using systematic reviews included in the training set. Once satisfied with the prompts, no further changes will be made to them, and we will apply the prompts to all systematic reviews in the test set. Two human assessors will independently assess all reviews in the test set manually using our PRISMA 2020 adherence tool (PRISMA-Check), which will include the same guidance and examples as those provided in the LLM prompts. We will compare performance of LLM assessments with consensus human assessments (our reference standard).
We will assemble a random sample of published systematic reviews that:
• Meet the PRISMA 2020 definition of a systematic review, that is, a review in which explicit, systematic methods are used to collate and synthesise studies that address a clearly formulated question.5 Systematic reviews will be eligible irrespective of whether they present a synthesis (e.g. meta-analysis), a structured summary (e.g. tabular or narrative description) of study findings, or both;
• Include studies evaluating the effects of one or more interventions on human health, irrespective of the type of study design (e.g. randomized trial, cohort study), outcome (e.g. continuous, binary) and effect measure used to quantify the intervention effect (e.g. mean difference, risk ratio);
• Were written in English and indexed in PubMed Central’s Open Access Subset14 in September 2025 (the month prior to study initiation).
To identify eligible systematic reviews, we will run the following search in PubMed: systematic [sb] AND pubmed pmc open access [filter] AND 2025/09/01:2025/09/30[EDAT]. The “systematic [sb]” component runs a search strategy designed to retrieve citations to systematic reviews in PubMed.15 All citations retrieved from the search will be exported into EndNote reference management software16 and sorted by record number (a unique identifier that EndNote automatically assigns to each record as it is added to a library). The first 500 sorted records will be imported into Covidence17 and one author (DPQC) will screen the titles and abstracts of each record. The same author will retrieve any potentially relevant full text reports (manually or via the automated article retrieval feature in EndNote) and screen each report. This step will be repeated as many times as needed until a target of 200 eligible systematic reviews is identified. Any uncertainties about eligibility will be discussed with the principal investigator (MJP). Prior to commencing each screening stage, two authors (DPQC and MJP) will independently pilot the screening process on 50 abstracts and 20 full text reports, respectively.
For each of the 200 included reviews, we will save the full-text HTML file and retrieve all associated supplementary materials in any format available. Each of the included reviews will be assigned a unique identifier based on its PubMed Central ID (PMCID). We will then use the = RAND() function in Microsoft Excel to assign a random number to each review and sort them by random number, with the first 100 being set aside for potential inclusion in the training set and the second 100 being included in the test set. We have selected a sample size of 100 systematic reviews for the test set to balance feasibility and precision. This sample size allows us to restrict the width of a 95% two-sided Wald-type normal confidence interval around the estimated percentage agreement between human and LLM responses to a maximum of 20% (i.e. precision ±10%), assuming a percentage agreement of 50%. For a percentage agreement of less (or greater) than 50%, the absolute width will be smaller; for example, an estimated percentage agreement of 90% would yield a confidence interval width of 12%.
To identify examples that could be included in few-shot prompts, we will assemble a sample of Cochrane reviews. These reviews will be chosen because they are more likely to adhere to PRISMA 2020 than non-Cochrane reviews for two primary reasons. Cochrane reviews are undertaken using the software Review Manager (RevMan), which includes a standard manuscript template (i.e. with section headings such as “Criteria for considering studies for this review” and “Synthesis methods”), and, additionally, guidance drawn from PRISMA 2020 on what to report in each of these sections. Also, Cochrane editors assess manuscripts to ensure they adhere to PRISMA 2020 items. We will identify Cochrane reviews indexed in PubMed Central’s Open Access Subset by running the following search in PubMed: (pubmed pmc open access [filter] AND cochrane database syst rev [so]) NOT protocol. One author (MJP) will screen records directly from the PubMed interface and retrieve the full text of the 50 most recently published Cochrane intervention reviews.
We are currently developing PRISMA-Check, a tool to assess adherence to PRISMA 2020, which reframes each reporting element into one or more questions. For example, the following element for the item on the data collection process, “Report how many reviewers collected data from each report, whether multiple reviewers worked independently or not, and any processes used to resolve disagreements between data collectors”, has been reframed into four questions:
• 9.1a. Did the authors report how many reviewers collected data from each report?
• 9.1b. If yes to 9.1a, did they report that multiple reviewers collected data from each report?
• 9.1c. If yes to 9.1b, did they report whether reviewers worked independently or not?
• 9.1d. If yes to 9.1b, did they report any processes used to resolve disagreements between data collectors?
All questions have “Yes/No” response options, with some also having a “Not applicable” option. Responses to questions map to element- and item-level responses (specifically, if the answer to all applicable questions is “Yes”, the element will be rated as “reported”, and if all applicable elements are rated as “reported”, the item will be rated as “reported”). Each question will be accompanied by guidance on when to select each response option, along with examples of complete reporting. Examples will be sourced from the exemplars presented in the supplement to the PRISMA 2020 explanation and elaboration paper6 and the 50 Cochrane reviews retrieved. We will modify real examples if they are missing some details or lack clarity, and, if necessary, will use an LLM (Gemini 3) to invent examples when none exist in our sources. One investigator (DPQC) will source examples, and each will be verified for appropriateness by the primary investigator (MJP).
PRISMA-Check currently includes 315 questions addressing all elements (n = 171) corresponding to 41 of the 42 PRISMA 2020 items; the item focusing on the systematic review abstract is excluded. For reasons for feasibility, in this project we will assess 200 questions addressing a subset of the elements (n = 95) that correspond to the 41 items, with at least one element per item being assessed. The omitted subset of elements consists of those assumed by the principal investigator (MJP) to be difficult to assess by both humans and LLMs because, for example, there is more subjectivity in the judgement. In future, we will develop methods to automate the assessment of all questions in PRISMA-Check.
Rather than directly uploading heterogeneous source files (text and figure) into an LLM interface, we will pre-process all source files for the systematic reviews into a structured format before submitting materials to the LLM via an API. The main text of the articles in HTML format will be parsed and standardised into a unified JavaScript Object Notation (JSON) structure with embedded figures encoded as URL objects. The main Python packages used in this pipeline will include “beautifulsoup4” and “html-to-markdown”. Supplementary files available in various file formats, such as PDF, DOCX, XLSX/CSV, and standalone figure files (TIFF, PNG, JPG), will be converted through format-specific pipelines into the same unified JSON structure. If all supplementary materials appear in a single PDF, we will use the Adobe PDF Services API (“adobe-pdfservices-sdk”) to extract each element and then the text will be standardised into the unified JSON format with figures encoded as base64 data. Text in the DOCX and XLSX/CSV files will be processed through format-specific pipelines with figures similarly encoded as base64 data. The main Python packages used in this pipeline will include “python-docx”, “openpyxl” and “base64”. Standalone supplementary figure files will also be encoded as base64 data. For figure files in unsupported formats (e.g. TIFF), the files will be first converted to PNG using the Python package “Pillow” and then encoded as base64 data. Following individual document conversion, the main article JSON file and all associated supplementary JSON files will be consolidated into the multimodal prompt format with textual and visual components indicated as explicit “input_text” and “input_image” elements presented in the order in which they appear in the documents. To assess the performance of the pipeline, JSON files will be generated for 15 systematic reviews randomly selected from the training set and will be reviewed and cross-checked against the source files by one investigator (MYZ).
Following OpenAI’s best practice guidance,18 we will develop few-shot prompts to guide the LLMs to answer each question selected from the PRISMA-Check tool. Each prompt will include a task description and the questions, guidance on when to select each response option and examples of complete reporting that appear in PRISMA-Check. We will instruct the model to provide for each question a response, relevant quotes appearing in the systematic review, and a rationale for the response. Prompts will be developed and optimised using GPT-5.4 first, and then further tested on an open-weight model (Qwen3.6 Plus).
We will use an iterative approach to prompt development. One investigator (MYZ) will draft prompts and evaluate them on 15 systematic reviews in the training set. Two investigators (DPQC, MZ) will then independently validate the responses made by the LLMs. Each investigator will consider the LLM’s output for each question asked and cross-check the output against the systematic review report and evaluate whether each response by the LLM was correct or not. Discrepancies will be resolved via discussion or, if necessary, via consultation with another investigator (MJP). The investigators will then identify key areas for improvement to the prompts and the PRISMA-Check tool, including to the wording of the questions, guidance on when to select each response option and the examples of complete reporting. We will revise the prompts accordingly and evaluate them on the 15 reviews previously evaluated and another 15 reviews in the training set. Part of this process will involve exploring how to ask questions most efficiently, for example by comparing accuracy and cost when using one lengthy prompt including all questions versus using 41 prompts asking the questions corresponding to each item. We will continue this revision-evaluation cycle on as many systematic reviews in the training set as is necessary for us to be confident that additional prompt refinements are unlikely to meaningfully improve response accuracy.
Each of the 100 systematic reviews in the test set will be assessed independently by two human assessors (DPQC will assess all 100 reviews while MYZ, DGH, BNS, PYN and MJP will each assess 20 reviews). Assessors will record several characteristics of the systematic reviews, including the journal that published the review, country of corresponding author, source of funding for the review, type of population evaluated (ICD-11 category), and type of intervention evaluated (drug, non-drug, or both). Assessors will then complete the PRISMA-Check tool, which will be administered via an online data collection tool (REDCap version 15.5.3019) and include the same questions, guidance and examples that appear in the final version of the LLM prompts. Investigators will extract relevant quotes or record the relevant table or figure title as supporting evidence for all questions. Investigators will be asked to consult only the systematic review article and supplementary file(s) provided to them and not to consult the review protocol (if cited) or any external website cited as hosting additional materials. Any discrepancies in responses will be resolved via discussion or, if necessary, via consultation with another investigator (MJP or JEM). Prior to undertaking manual assessments, investigators will independently pilot the assessment process on five systematic reviews to familiarise investigators with the PRISMA-Check tool.
We will evaluate our prompting approaches using the 100 systematic reviews included in the test set on three proprietary LLMs (GPT-5.4, Gemini 3.1 Pro Preview, and Claude Opus 4.6, which are the best performing LLMs as of April 2026 according to Artificial Analysis’ LLM Leaderboard20) and one of the best performing open-source LLMs (Qwen3.6 Plus). Each of the LLMs will assess PRISMA 2020 adherence for each systematic review using the few-shot prompts developed during the training phase and will be instructed to generate a JSON file providing for each question a response, relevant quotes appearing in the systematic review, and a rationale for the response. We will adjust the default LLM parameters to minimise randomness (e.g. set the temperature to 0) to ensure the model adheres closely to the instructions and minimises improvisation. To measure resource use of each LLM we will record the number of tokens, cost and generation time for each systematic review assessed.
All analyses will be conducted using Python.21 We will calculate frequencies and percentages to summarise systematic review characteristics and median and interquartile range to summarise token counts, costs and generation time of each LLM. We will employ a diagnostic test accuracy analysis framework using the consensus response between the two human assessors as the reference standard. We will estimate the performance of each of the LLM assessments against the reference standard by calculating the following metrics: percentage agreement, Gwet’s Agreement Coefficient (AC),22 sensitivity, specificity, positive predictive value, negative predictive value, and F1 score13 (see Tables 1 and 2 for terminology and descriptions of each metric). These performance metrics will be calculated at the element level, at the item level, and as overall summaries across all elements and across all items. The overall performance estimates will be used to inform our conclusions on which LLMs (or combinations of LLM judgements) are acceptable. Prior to data analysis, we will specify thresholds that define acceptable performance for each metric. These thresholds will be determined through discussion with the authorship team. The performance estimates at the element level and item level will be used to identify elements and items where more prompting or examples may be required in future. We will calculate 95% confidence intervals for each of the above metrics. Confidence intervals for performance metrics calculated across all elements and across all items will be obtained using the bootstrapping percentile method with 5000 replications. We will allow for clustering of observations within systematic reviews by resampling the entire review.
| LLM | |||
|---|---|---|---|
| Reported | Not reported | ||
| Reference standard (human consensus judgement) | Reported | TP | FN |
| Not reported | FP | TN | |
| Term | Interpretation in the context of current study |
|---|---|
| Percentage agreement | Measures the proportion of elements/items that are correctly classified by the LLM as reported or not reported relative to the human consensus judgements. Higher percentage agreement indicates that humans and LLMs are consistently producing the same judgement. The formula is: (TP + TN)/(TP + TN + FP + FN) (see Table 1 for definitions) |
| Gwet’s Agreement Coefficient (AC) | Measures the proportion of elements/items that are correctly classified by the LLM as reported or not reported relative to the human consensus judgements (corrected for chance) |
| Sensitivity (recall) | Measures the proportion of elements/items that are correctly classified by the LLM as reported. Higher sensitivity indicates that fewer elements/items are incorrectly identified by the LLM as not reported. The formula is: TP/(TP + FN) |
| Specificity | Measures the proportion of elements/items that are correctly classified by the LLM as not reported. Higher specificity indicates that fewer elements/items are incorrectly identified by the LLM as reported. The formula is: TN/(TN + FP) |
| Positive predictive value (precision) | Measures the proportion of elements/items that an LLM identified as reported that are actually reported. Higher positive predictive value indicates that when the LLM identifies an element/item as reported, it is likely to be correct. The formula is: TP/(TP + FP) |
| Negative predictive value | Measures the proportion of elements/items that an LLM identified as not reported that are actually not reported. Higher negative predictive value indicates that when the LLM identifies an element/item as not reported, it is likely to be correct. The formula is: TN/(TN + FN) |
| F1 score | Harmonic mean of sensitivity and precision. Higher F1 scores indicate better overall accuracy in identifying true positives while minimizing false positives and false negatives. The formula is: 2 × (Sensitivity x Precision) / (Sensitivity + Precision) |
To generate each of the above metrics we will perform the following comparisons:
• GPT-5.4 response versus human consensus response;
• Gemini 3.1 Pro Preview response versus human consensus response;
• Claude Opus 4.6 response versus human consensus response;
• Qwen3.6 Plus response versus human consensus response;
• Response selected by at least three of the LLMs versus human consensus response;
• Response selected by all four of the LLMs versus human consensus response.
We will also investigate discrepancies between LLM responses and the human reference standard by systematically comparing the supporting evidence extracted by both LLMs and humans. This analysis will explore potential sources of disagreement, including differences in reasoning approaches or use of different supporting evidence. We acknowledge the inherent limitations of treating the human consensus responses as our reference standard and recognize that they do not constitute a perfect gold standard. Consequently, we will critically examine cases where LLMs diverge from human responses to determine whether the model outputs may, in some instances, be equally or more accurate.
We will disseminate the findings of this study via publication in a peer-reviewed scientific journal and at relevant academic conferences and seminars to reach researchers, journal editors, and developers of reporting guidelines and LLMs. We will make LLM prompts, the clean dataset, and analytic code publicly accessible to facilitate verification of our work and future research in this area.
As of 14 April 2026, we have run the searches and screened and identified 200 systematic reviews for inclusion in the training and test set, commenced drafting of the PRISMA-Check tool, tested the Python pipeline to pre-process all source materials for 15 systematic reviews, and commenced initial drafting of prompts.
We believe the detailed assessments of PRISMA 2020 adherence performed in our study will be more informative to various interest holders than the item-level assessments performed in other studies. The PRISMA 2020 items were designed to be broad, with the elements providing the specific recommendations. For example, item 20c reads, “Present results of all investigations of possible causes of heterogeneity among study results”. Without consulting the elements for this item, humans and LLMs may interpret the term “results” differently; for example, does a “result” mean a P value for a test for interaction for a subgroup analysis, or does it mean the summary estimate and a measure of precision for each subgroup, or all of these? Furthermore, humans and LLMs might provide the same response to a question about adherence to this item yet be drawing on different supporting evidence, depending on how they have interpreted the question. By reframing the items into questions and providing guidance for when to select each response option and examples of optimal reporting to guide human and LLM responses, we hypothesise that this approach will yield improved performance estimates as compared with those observed in previous studies.
Our study will determine which questions in the PRISMA-Check tool can be accurately automated by LLMs. This knowledge will help inform which questions need the most human oversight by meta-researchers, peer reviewers and other interest holders seeking to assess adherence. We recognise that the number of questions in our adherence tool is large and that authors and editors of a systematic review might not want a report providing many Yes/No judgements. Future research could build on our planned work to identify how best to consolidate the adherence assessments and to develop tools that provide constructive feedback to authors for improving their systematic reviews.
| Views | Downloads | |
|---|---|---|
| F1000Research | - | - |
|
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for, and objectives of, the study clearly described?
Yes
Is the study design appropriate for the research question?
Partly
Are sufficient details of the methods provided to allow replication by others?
Partly
Are the datasets clearly presented in a useable and accessible format?
Not applicable
References
1. Page M, Mayo-Wilson E, Zeng M, Clark D, et al.: Evaluation of automated assessments of systematic review adherence to the PRISMA 2020 statement: study protocol. F1000Research. 2026; 15. Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Interest in AI applications in Medicine, Scientific Writing, Peer Review, and Editorialship
Alongside their report, reviewers assign a status to the article:
| Invited Reviewers | |
|---|---|
| 1 | |
|
Version 1 04 May 26 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)