Diagnosis of pelvic endometriosis: a systematic review and accuracy meta-analysis of non-invasive tests.

Background: Endometriosis is a chronic, often debilitating condition with a current signiﬁcant delay from symptom onset to diagnosis. Objectives: To investigate the accuracy of symptoms, clinical history and non-invasive tests to predict pelvic endometriosis. Data sources: Medline, Embase, Web of Science and Scopus from conception to September 2022. Selection criteria: Primary test accuracy studies assessing selected non-invasive tests against a reference standard diagnosis for endometriosis. Data extraction and synthesis: Two authors independently conducted data extraction and study quality assessment. Grading of evidence was performed using a novel visual pentagon model. Meta-analyses of test accuracy was estimated using bivariate random eﬀects models. Results: The 125 included studies (250,574 participants) showed mixed quality. Studies applying non-surgical (database/self-reporting) reference standard had a greater risk of bias. In 98 studies applying surgical reference standard, summary diagnostic odds ratios were: dysmenorrhoea 2.56 (95% conﬁdence interval 1.99-3.29); pelvic pain 2.56 (1.73-3.74); dyschezia 2.05 (1.36-3.10); dyspareunia 2.45 (1.71-3.52); family history of endometriosis 6.79 (4.08-11.3); nulligravidity of 2.01 (1.62-2.50); BMI [?]30kg/m2 0.37 (0.19-0.68); TVUSS endometrioma 91.2 (44.0-189); TVUSS invasive endometriosis 26.1 (9.28-73.5); and CA-125 >35U/mL 16.0 (8.09-31.7). Sensitivity analysis excluding all high-risk studies found concordant results. Conclusions: This meta-analysis collated the performance of non-invasive tests for endometriosis across a comprehensive and geographically varied population. Study quality was mixed, however results were consistent with high-risk studies excluded. These ﬁndings will inform future prediction models for triage in primary care. Funding: This research received no speciﬁc funding.


Introduction:
8][9] Reducing this has long been a research priority. 10,113][14] A better understanding of the accuracy with which symptoms, clinical history and non-invasive tests diagnose pelvic endometriosis would aid triage and referral.
6][17] These are often invasive or otherwise not applicable to primary care. 16,18Previous meta-analyses have been restricted by a narrow inclusion criteria yielding a small number of eligible studies, limiting findings. 19Meta-analyses assessing imaging and biomarkers showed that no individual test met their criteria as a replacement or triage test alone, but findings on trans-vaginal ultrasound and serum CA-125 showed a high specificity for disease. 20,21These studies did not include assessment of other clinical factors.A 2019 narrative review demonstrated the importance of clinical factors in prediction of disease, but primary studies were not assessed for quality and absence of meta-analysis limited quantitative assessment of test performance. 222021 case-control study reporting on the accuracy of a simple patient-completed questionnaire identified those at high or low risk of disease with good accuracy reflecting the utility of assessment of patient reported symptoms at a primary care level. 24search related to endometriosis is increasing in volume, with 75% of primary studies published in the last decade. 25,26Given this, and due to the limitations of previous reviews a new comprehensive systematic review and meta-analysis is required.We performed such an evidence synthesis and determined the accuracy of symptoms, clinical history, a simple low-cost biomarker and first line ultrasound for the diagnosis of pelvic endometriosis by means of a comprehensive systematic review and accuracy meta-analysis.

Methods:
The protocol was designed and registered with PROSPERO (Registration number: CRD42020187543). 27eporting followed the PRISMA guidelines. 28

Patient and public involvement
A patient and public involvement meeting was held following an open invitation on social media.The aim to reduce the diagnostic delay was well supported and resonated with their personal experience.

Literature search and study selection
Medline, Embase, Web of Science and Scopus were searched from conception to September 2022.Search strategies are shown in Appendix S1.Review article references were also screened.Titles and abstracts were screened independently by two authors (TB; SJ) using EndNote-X9 and duplicate or irrelevant studies removed. 29Full texts were screened and justification for inclusion or exclusion recorded, differences were resolved by discussion with the senior author (AR).We included published peer reviewed studies reporting accuracy estimates to predict pelvic endometriosis (peritoneal; ovarian; or invasive disease >5mm retroperitoneal invasion) for one or more index tests in participants with reported presence/absence of endometriosis.Included tests were dysmenorrhoea; pelvic pain; dyschezia; dyspareunia; nulligravidity; BMI [?]30kg/m 2 ; family history of endometriosis; transvaginal ultrasound finding (TVUSS) of endometrioma; TVUSS finding of invasive disease; and serum CA-125 >35U/mL.Target population was reproductive age women excluding pregnancy or systemic co-morbidities.Studies reporting non-reproductive age participants were included only where their data could be excluded from meta-analysis.Studies were included where a 2x2 contingency table for index test(s) could be constructed.We imposed no limits to language, setting, or number of participants.All non-English studies were translated by a medically trained native speaker.
Studies reporting laboratory tests and imaging were included only when performed prior to reference standard and with ultrasound using only standard 2D protocols.Definitions for each test for the purposes of study selection are shown in Appendix Table 1.
We excluded reviews; case reports; studies where information on recruitment or study population was unavailable; letters; and abstracts.Studies reporting non-pelvic endometriosis were included only where pelvic endometriosis was reported separately.Authors were contacted only to obtain full texts or, failing this, they were obtained through the British Library.Studies with incomplete data preventing determination of inclusion, exclusion or test accuracy were excluded.

Data extraction and quality assessment
For each included study, two authors (TB; SJ) independently recorded information on study characteristics and data was extracted to form 2x2 tables.Where there was unreliability in data extraction from some non-English language studies, these were excluded.
Risk of bias and applicability was assessed independently by two authors (TB; SJ) using the Quality Assessment of Diagnostic Test Accuracy Studies (QUADAS 2) tool. 30For studies regarding serum CA-125 or TVUSS we included the additional signalling questions: 'was the index test performed by a single operator?' to assess inter-observer bias; and 'was timing in the participants' menstrual cycle controlled for?'.We adjusted the original question 'if a threshold was used, was it pre-specified?' to 'was there a clear definition of what was considered a positive test?'.

Data synthesis
Due to differences in design, studies were divided into groups according to application of the reference standard: 'Complete verification', all participants received visual inspection of the pelvis at surgery; 'Partial verification', all cases received surgical confirmation but controls did not; and 'Database/self-reporting', cases confirmed by healthcare coding or self-reporting and controls from healthy populations not known to have endometriosis.
Statistical analysis was conducted using Stata software (version 15) 31 to allow exploration of heterogeneity and statistical pooling using a bivariate random effects model and produced summary accuracy measures and summary receiver operative characteristic curves for each index test.A bivariate random effects model was applied for index tests with [?]5 contributing studies, and a univariate fixed effects model for index test with [?]4.
Index tests were assessed for performance as a 'rule-in' or 'rule-out' tool with pre-specified threshold summary accuracy of 95% sensitivity/50% specificity or 95% specificity/50% sensitivity respectively.

Study selection and characteristics
Of 22,016 studies identified 125 met the inclusion criteria involving 250,574 participants (Figure 1).Characteristics of included and excluded studies are shown in Table S1.Details of included studies by number of participants and index test(s) are shown in Appendix Table 1.
Mean number of index tests per study was 2 (range 1-6).A total of 241 were assessed across all studies.Included studies were geographically varied: 45 from Europe, 34 North America, 19 Asia/Oceania, 13 the Middle East/Africa, 12 South America, and 2 transcontinental.Publication date ranged from 1986 to 2022 with 57% since 2010: 4 before 1989; 22, 1990-1999; 18, 2000-2009; and 71, 2010-2022.Most studies (75) were 'single-gate' design with 50 of 'two-gate' design, including all studies in the partial verification and database/self-reporting groups.The mean prevalence of endometriosis in studies of a 'singlegate' design was 52% (range 9-93%), due to the selection of matched controls, prevalence in 'two-gate' studies was not relevant.There was heterogeneity in population selection, with participants having surgery for a broad range of indications such uterine fibroids or adnexal cysts as well as pelvic pain or sub-fertility.
In the 61 studies assessing symptom-based tests 20 did so by self-administered questionnaire; 14 by structured interview; 12 by clinical history taking, and 15 were undefined.

Risk of bias
The assessment of study quality by QUADAS-2 is presented in Figure 2 and Supplementary Figures S2-5.Overall methodological quality was mixed, with 5 studies presenting a low risk of bias across all domains [32][33][34][35][36] and 64 presenting a high risk of bias or applicability in at least one domain.
In patient selection, 22 studies presented a low risk of bias, with 73 and 30 presenting an unclear or high risk respectively.Non-consecutive or non-random selection, two-gate selection for cases and controls, and having a highly selected group of participants (infertility cohort, surgery for a narrow indication etc.) were the main reasons for a high risk of bias.
Symptom based index tests presented an unclear or high risk of bias due to a lack of definition of a positive test and of blinding.Just 9 studies presented a low risk in symptom-based tests across all groups.Index tests applicable to clinical history or investigations performed better, with 66 studies presenting a low risk.Reasons for an unclear or high risk of bias were a lack of pre-specified criteria for a positive test; no blinding to results of the reference standard; and inter-observer variability regarding imaging.Studies in the partial verification group assessed a proportionally higher number of index tests nulligravidity and BMI [?]30kg/m 2 , which, less subjective to interpretation presented a lower risk of bias.
The risk of bias regarding the reference standard performed best in the complete and partial verification groups where 74 studies were at low risk of bias.Those with an unclear or high risk lacked information on how likely the surgery was to correctly classify the target condition or operators not blinded to the result of index test(s).In the database/self-reporting group, 5 studies assessed for probable surgical confirmation by means of additional codes at the time of recording and therefore presented a lower risk, all other studies were high risk.
In flow and timing, the complete verification group presented the lowest risk.An unclear or high risk of bias was attributable to a long (>12 months) or unclear time interval, and a high or unclear withdrawal of participants from analysis.All other studies presented a high risk as not all participants received the same reference standard.

Applicability
In patient selection, 40 studies gave low concern, with 63 and 22 giving unclear or high risk respectively.An unclear or high risk was attributable to the two-gate selection of controls, or the study likely to only classify a limited spectrum of disease (tertiary centres or infertility clinics).
In regard the reference standard, 97 studies showed a low concern.Studies in the database/self-reporting group were deemed high/unclear depending on whether additional coding input for surgery was recorded.

Test accuracy
Due to heterogeneity in methodology and study quality, meta-analysis was performed on studies from each group separately.
The accuracy of index tests to predicting endometriosis was variable, although results across groups were consistent.Each index test gave a positive likelihood for the presence of pelvic endometriosis, apart from a BMI [?]30kg/m 2 , which decreased the likelihood of disease.The positive likelihood ratio (LR+) for disease was highest in investigation tests and there was a trend towards a greater specificity than sensitivity.The summary results of bi/univariate meta-analysis are shown in Figure 3.An assessment of confidence in individual sensitivity and specificity of each test is displayed by a visual pentagon model, the methodology for this assessment is described in the discussion and legend shown in Figure 4.
Investigation category tests were the best performing overall and TVUSS finding of endometrioma gave the highest summary LR+ at 21.6, at sensitivity and specificity of 77.2% and 96.4% respectively.Serum CA-125 >35U/mL showed sensitivity and specificity of 55.8% and 92.7% respectively, with LR+ of 7.63.TVUSS finding of DIE had showed sensitivity and specificity of 86.5% and 80.2% with LR+ of 4.39.
Symptom based tests showed LR+ within a similar range: 1.47 (dysmenorrhoea) to 1.93 (dyspareunia).Symptoms showed a generally higher specificity than sensitivity.Dyspareunia showed the highest LR+ at 1.93 with a sensitivity and specificity of 36.3% and 81.1% respectively.
Family history of endometriosis showed a LR+ of 6.25 with a high specificity (98.5%) but low sensitivity (9.25%).The finding of BMI [?]30kg/m 2 showed a decreased likelihood of diagnosis of endometriosis (LR+ 0.44).S6-8.The HSROC curves show the greatest area under the curve (AUC) for investigation category tests.

Hierarchical Summary Receiver Operating Characteristics (HSROC) curves for index tests in each group are shown in Figures
In the partial verification group, symptom index tests showed a greater LR+ than the complete verification group, range 2.47 (dysmenorrhoea) to 7.13 (dyschezia).Specificity was also higher, range 69% (dysmenorrhoea) to 92% (dyschezia).
In the database/self-reporting group symptom-based index tests performed similarly to other groups.In partial verification and database/self-reporting groups BMI [?]30 kg/m 2 showed no correlation with disease and had 95% CI crossing 1.0.In all other index tests across all groups the 95% CI was >1.0.
The greatest inter-study variability in confidence intervals was shown in Forest plots for the symptom-based tests, notably pelvic pain.The inter study variance for specificity was generally lower than that for sensitivity, as was the overall width of confidence intervals.Forest plots for each index test in each group are shown in Supplementary Figures S9-15.
Sensitivity analysis performed for studies without any high-risk features is shown in Table 1.All studies included are from the complete verification group.Summary accuracy measures are consistent with those in this group for the majority of index tests, although sensitivity for TVUSS finding of endometrioma and DIE reduced to 69.8% and 73.4% respectively.

Main findings
This meta-analysis presents an up-to-date, large, and geographically varied data set identifying predictive factors for diagnosis of pelvic endometriosis with a high degree of confidence.Index tests showed a positive association with endometriosis and trended towards a greater specificity than sensitivity, excluding elevated BMI, which demonstrated an inverse correlation.TVUSS finding of endometrioma reached a desired threshold for use as a 'rule-in' test and none achieved a summary sensitivity of >95% for 'rule-out'.A family history of endometriosis, dyschezia and serum CA-125 >35U/mL showed summary specificity of >90% although low sensitivity.Sensitivity was poor for symptom and clinical history tests, where the best performing was dysmenorrhoea.

Strengths and limitations
We undertook a thorough search of the current literature, undergoing analysis by two independent reviewers with strict quality assessment.Attempts were made to mitigate inter-study heterogeneity by division of studies into groups.All index tests are relevant to primary care and immediately available without novel techniques or additional training.There were, however, limitations.Due to difficulties in data extraction from some non-English journals, 15 studies were excluded from the analysis.Some studies, such as Chapron et al 2005, which was seminal in providing a clinical prediction model for moderate/severe endometriosis, were not able to be included due to the inability for construct 2x2 tables. 37We did not contact authors to obtain individual data that was not available in the published text.
Overall, there was significant methodological variance and population heterogeneity in age; presentation; and stage of disease.Variation in selection of cases and controls may not reflect a clinically representative population.Prevalence of disease was higher than seen in the general female population, which may reflect a high degree of surgical accuracy, but also indicates the selective nature of study populations.
There is the possibility of inappropriate assignment of cases and controls, occurring in both directions due to uneven application of the reference standard, although we attempted to account for this by assigning groups.[40] There was variation in the definition of positive symptom index tests.][43] Assessment of symptoms varied, with most studies using a self-administered questionnaire.Although the use of standardised validated tools would better allow for comparison across studies, the nuance and detail acquired through clinical history taking is likely to better grasp the nature and significance of a symptom and its implications.
It is likely that imaging and surgical techniques have developed over time.A trend towards recent studies may mitigate this.
Considering the balances of strengths and weaknesses, however, we believe that our data synthesis presents an objective summary of the current evidence.

Interpretation
An understanding of the degree of likelihood associated with various symptoms and features in the clinical history can help assessment of patients with possible endometriosis in primary care.
The negative association between elevated BMI and endometriosis shown in the complete verification group is consistent with that demonstrated previously. 44This was not replicated across other groups.This may reflect a greater negative correlation between elevated BMI in higher risk populations in the all surgical cohorts who may have more severe disease.This possibility is consistent with previous studies, demonstrating a significantly lower BMI in those with severe compared to mild disease and a 12-14% decrease in the likelihood of endometriosis being diagnosed for each unit increase in BMI (kg/m2). 32,45The interplay between BMI and endometriosis pathogenesis, however, remains poorly understood.
The trend of data from the partial verification and database/self-reporting groups to demonstrate better performing accuracy measures was likely a reflection of the selection of controls.This effect seems to outweigh the possibility of an undiagnosed disease burden in those not exposed to a surgical reference standard.The accuracy of self-reported diagnosis of endometriosis has been assessed and performs well, 46 false attribution of disease in the self-reporting group may therefore only present a small source of bias.
A greater specificity than sensitivity of tests may be associated with their correlation to disease severity.Dyschezia and dyspareunia have been linked to severe disease due to the involvement of a precise anatomical location in invasive disease, for example, but are less often present in mild cases. 47,48Tests showing a greater sensitivity such as dysmenorrhea were also less specific, which may only become specific for endometriosis in more severe forms.
Previous systematic reviews have similarly highlighted the heterogeneity and poor methodological quality of primary studies, limiting interpretation of findings. 17,49As our methodology allowed wide inclusion criteria, we applied a novel grading protocol to more quantitively assess limitations.Grading of evidence for index tests was performed for sensitivity and specificity by application of a visual pentagon model for grading of test accuracy studies described by Rogozinska and Khan. 50This methodology is described in detail elsewhere but briefly, studies were given a score of 0 to -2 in each of 5 domains: design (study design type); risk of bias (QUADAS 2 risk of bias); indirectness (QUADAS 2 applicability); inconsistency (visual assessment of inter-study variance in confidence intervals); and imprecision (width of confidence intervals).The complete verification group showed the fewest limitations, whist the database/self-reporting studies showed very serious limitations.There was greater limitation in the investigation category tests due to more highly selective populations and a generally higher inter-study inconsistency and imprecision.

Research recommendations
The need for high-quality studies of predictive factors for endometriosis remains, particularly assessing populations attending primary care.Further multivariate analysis in powerful primary observational studies assessing factors that can be immediately and readily assessed in primary care would be of great value, as we anticipate the index tests assessed in this study to provide a greater degree of accuracy when applied in combination. 4,37,51 examined serum CA-125 at a cut off >35 U/mL, considered the upper limit of normal range, meta-analysis from 2016 found a cut off of 30 U/mL gave a sensitivity and specificity of 52% and 93% respectively, but sensitivity dropped to just 24% for detection of minimal disease. 53Further research assessing the accuracy of CA-125 at different thresholds and in combination with other tests could help improve accuracy.
Two recent studies (Fauconnier et al 2021 and Chapron et al 2022) assessed the accuracy of a patientcompleted questionnaire and epidemiological data for the early identification of endometriosis and found it could do so with high diagnostic accuracy. 24,55Although these studies were conducted in a high-risk population undergoing surgery, the model maintained accuracy in population with a lower endometriosis prevalence of 10%.We do not anticipate a clinical score replacing laparoscopy due to it's added therapeutic advantages and requirement to exclude other pathologies.If, however, disease can be predicted with a high degree of accuracy early on, medical therapy may be instigated, and referral made for definitive diagnosis and counselling regarding treatment, prognosis and fertility in a timely manner with the aim of reducing the current extraordinary delay.