Audit of transvaginal sonography of normal postmenopausal ovaries by sonographers from the United Kingdom Collaborative Trial of Ovarian Cancer Screening (UKCTOCS)

Background: We report on a unique audit of seven sonographers self-reporting high visualization rates of normal postmenopausal ovaries in the United Kingdom Collaborative Trial of Ovarian Cancer Screening (UKCTOCS). This audit was ordered by the trial’s Ultrasound Management Subcommittee after an initiative taken in 2008 to improve the quality of scanning and the subsequent increase in the number of sonographers claiming very high ovary visualisation rates. Methods: Seven sonographers reporting high rates (>89%) of visualizing normal postmenopausal ovaries in examinations performed between 1 st January and 31 st December 2008 were identified. Eight experts in gynaecological scanning reviewed a random selection of exams performed by these sonographers and assessed whether visualization of both ovaries could be confirmed (cVR-Both) in the examinations. A random effects bivariate probit model was fitted to analyse the results. Results: The eight experts reviewed images from 357 examinations performed on 349 postmenopausal women (mean age 60.0 years, range 50.2-73.3) by the seven sonographers. The mean cVR-Both obtained from the model for these sonographers was 67.2% with a range of 47.6-86.5% (95%CI 63.9-70.5%). The range of cVR-Both between the experts was 47.3-88.3% and the intra-class correlation coefficient (ICC) for left and right ovary confirmation was 0.39. Conclusions: The audit suggests that self-reported visualization of postmenopausal ovaries is unreliable, as visualisation of both ovaries could not be confirmed in almost a third of examinations. The agreement for visualization of both ovaries based on review of a static image between experts and sonographers and between expert reviewers alone was only moderate. Further research is needed to develop reliable Quality Control metrics for transvaginal ultrasound.

research is needed to develop reliable Quality Control metrics for transvaginal

Introduction
The normal ovary of a postmenopausal woman is a small structure (mean volume 1.25ml 1 ) usually situated lateral to the uterine fundus and in close relation to the internal iliac vein. In as many as 40% of transvaginal ultrasound (TVS) examinations 2 the ovary may not been seen as typically they shrink with age and are sometimes very difficult to locate 3,4 . For this reason in the United Kingdom Collaborative Trial of Ovarian Cancer Screening (UKCTOCS) and other screening trials 2,5,6 a pragmatic approach is taken whereby an annual screening examination may be judged satisfactory even if both ovaries are not seen, given that a good view has been achieved of the Iliac vessels in the pelvic side wall. However, the sonographer should always attempt to visualize both ovaries as this provides the maximum assurance that an early ovarian cancer has been excluded.
A metric commonly used in the quality control (QC) of TVS is self-reported visualisation rate (VR), defined as the number of examinations in which the ovaries were visualized as a proportion of all examinations performed by the sonographer 7 . In 2008, UKCTOCS implemented an accreditation programme which included the monitoring of individual sonographer VR over a 3 month period 8 . This revealed that some sonographers were self-reporting higher than expected VR. Therefore in 2009, it was decided to audit the performance of these high scoring sonographers to confirm independently whether it is possible to achieve high rates of ovary visualisation in postmenopausal women. We report on this audit and its outcome.

UKCTOCS trial
The TVS in this study were performed as part of the UKC-TOCS, which is a multi-centre randomized controlled trial of 202,638 women volunteers from 13 trial centres throughout Northern Ireland, Wales and England (ISRCTN22488978). The inclusion criteria specified by the trial protocol were postmenopausal women aged between 50-74 years. The women were randomised into three groups with the ultrasound arm involving 50,639 women who underwent annual TVS examinations.
Sonographers performing the examinations were required to 1) record whether the ovary had been visualized, 2) measure the ovary in 3 orthogonal dimensions, and 3) comment on its morphology. These observations were stored centrally in the Trial Management System (TMS). The sonographer measured the dimensions of each ovary using digital callipers manually positioned on the extent of the ovary boundary in static images in two orthogonal planes during the examination; see Figure 1.
The distance between the calliper marks was displayed in millimeters at the bottom of the image and copied into the TMS exam record fields as D1, D2 and D3. D1 represents the longest ovarian distance in longitudinal section (LS) and D2 is the widest distance (Anteroposterior -AP) which can be measured at 90° to the line used to measure D1. The largest diameter of the ovary in transverse section (TS) is measured as D3. These dimensions allow calculation of ovarian volume using the prolate ellipsoid formula; D1xD2xD3 x0.5423.
The TVS images used to measure the ovaries for each patient were saved on the ultrasound machines at each of the 13 trial centres and periodically copied onto disks which were sent by courier to the trial coordinating centre in London where they were copied into a bespoke computer system called the Ultrasound Record Archive (URA). These archived static images allow independent confirmation as to whether the feature measured was an ovary, thus permitting a subsequent audit of the sonographer's self-reported VR.

Audit dataset
Sonographers who had performed >100 TVS exams between January 2008 and January 2009 and who had reported a high rate of ovary visualisation (>89%) over this period were identified. The audit dataset was created by assigning a random number to the annual exams performed by each of the sonographers during this same period and then making a random selection for each sonographer based on the value of these numbers. Inclusion criteria were both ovaries reported as visualized and the examination classified as having normal morphology. Examinations were excluded if the corresponding images were not stored in the URA. All exams audited were performed using a Medison Accuvix (model XQ, software v1.08.02, transvaginal probe type EC4-9IS 4-9 MHz).

Audit methodology
Eight members of the UKCTOCS Ultrasound Subcommittee who were highly experienced in gynaecological scanning undertook the review. They included three consultant gynaecologists, two gynaecological radiologists and three National Health Service (NHS) superintendent grade sonographers. Originally there were nine experts but it subsequently transpired that one of the reviewers was also one of the seven sonographers being audited. Therefore, it was decided to remove this reviewer's results from the study. Accordingly, though these experts were initially split into three groups of three, one group was reduced to two experts following the exclusion of reviewer nine.
The audit dataset was randomly split such that each group reviewed 119 exams (total 357 exams) and each expert was asked to assess 17 exams performed by each of the seven sonographers. In this way, each exam was judged by at least two separate experts. In order to avoid bias each expert was blinded as to the name of the sonographer being reviewed and the assessment of the other experts.
The primary aim of the audit was to confirm the self-reported visualisation of both ovaries (cVR-Both) in examinations by each of the seven sonographers, which by extension required each expert reviewer to identify the exact images used to measure both ovaries from all of the images captured during the exam (mean 5.4, range 1-30). A software tool called osImageManager was developed specifically for the reviewers ( Figure 2). It facilitated display of the images associated with each of the examinations and also recorded the review results in the audit database.

Statistical analysis
The baseline characteristics of the women are reported by trial centre code, age, years since last period, body mass index (BMI), hysterectomy status, oral contraceptive pill (OCP) and hormone replacement therapy (HRT) use. Information from the UKCTOCS sonographer accreditation records was used to calculate the mean, range and standard deviation of their collective experience. Their level of training and qualifications was also compared. Raw confirmed VR for each sonographer, each expert and overall were calculated for left ovary (LO) and right ovary (RO) as well as jointly for both LO and RO in the same examination. However, for formal inference we calculated the confirmed VR based on a statistical model.

Statistical modelling
All modelling was performed in Stata v14.2.
Model description. The data was analysed using a bivariate probit random effects model. The bivariate outcome was the experts' binary judgement of whether they confirmed the scan as seen or not seen, for both LO and RO. For the LO and RO portion of the model there was a scan-specific random intercept term representing the dependence of judgements within each scan, rated by three (or two) expert reviewers. The LO and RO random effects were allowed to covary as were the LO and RO error terms. In addition the model had categorical fixed effects for the original sonographer (n=7) and the expert (n=8). The details of the model can be found in Supplementary File 1. The model was fitted in Stata 14.2 with the user-written command cmp 9 . Two additional models were fitted. Firstly, one that included the factor 'qualification' (gynaecologist, radiologist, sonographer) instead of the factor 'expert' which, fully nested within 'qualification', meant both terms could not be included. Secondly, the factor 'expert' was simply taken out for reasons described in 'Predictions and Correlations'.
The use of this statistical model allowed us to simultaneously analyse all the data despite some scans being judged by a different number of experts. This included instances when only the LO or RO of a scan had been reviewed. By making use of model-based predictions, the model allowed us to assess the impact of each sonographer (or reviewer) whilst generalizing over the sample of reviewer (or sonographer) and volunteers, separately for LO and RO, but also for both ovaries in a joint manner. The raw proportions, summed over either sonographer or reviewer, fail to take in the within-volunteer correlation. All joint significance tests of the parameters were Wald tests.

Predictions and correlations.
Stata's post-estimation command margins were used to make predictions based on the probit model parameters. Specifically, marginal probability predictions were made over the whole sample, and for each sonographer and expert for both equations (LO and RO). In addition, the joint probability of a positive outcome for both LO and RO were calculated by incorporating the estimated correlation of both the random intercepts and error terms. All marginal predictions were 'population-averaged' in that they were integrated over the value range of the random effects. Individual random effects were calculated using empirical Bayes means. Separate intraclass correlation coefficients (ICC) for both LO and RO were calculated using the variance component estimates (see Supplementary File 1). The ICCs estimate the dependence between the dichotomous outcomes within the same volunteer, after taking into account the fixed effects. The ICC was also calculated based on a model with no 'expert' term, as its inclusion will provide an ICC that reflects within-scan correlation after adjusting for each expert's general propensity to confirm visualisation. Supplementary File 1 also describes the calculation of the correlation between the left and right ovary result for a given volunteer on a given review occasion, necessary for the joint probability estimation. Note that the correlations from a probit model are 'tetrachoric' -that is, the correlation of two theorised normally distributed continuous latent variables, which produce the observed binary outcomes.

Results
An audit dataset of 357 annual TVS exams from 349 women was produced by making a random selection of 51 exams performed by each of the seven UKCTOCS sonographers who had reported ovary visualisation rates >89% for the exams they had performed during the study period (1/1/08 to 31/12/08) irrespective of outcome; normal, abnormal or unsatisfactory. However, only examinations with normal morphology reported were reviewed. Fifteen reviews were ineligible for various reasons.
The eight expert reviewers performed the image review at locations in Derby, Manchester, Bristol and London. They collectively spent approximately 100 hours conducting their audit of the work of the seven UKCTOCS sonographers. The sonographers had a mean experience of 14.5 years (range 7-23, SD 7). They operated in five different trial centres with two pairs of sonographers working in the same centre. All sonographers were accredited by UKCTOCS during 2008.
The 349 women whose exams were included in the audit dataset had a mean age of 60.0 years (range 50.2-73.3, SD 5.85), mean age at last period of 49.3 years (range 27.9-70.0, SD 5.66), mean BMI of 26.2 (range 17.5-45.1, SD 4.17), use of HRT at recruitment of 24.9%, ever use of OCP of 64.7% and a history of hysterectomy in 12.4%.

Model results
In total the model fitted 1871 ultrasound scan assessments formed from 940 LO scans and 931 RO scans resulting in 945 scans where at least one ovary was included. The fixed effects of both sonographer and expert were highly significant for either left or right ovary (joint p<0.0001 always, Table 1). As expected, the fitted predictions for LO or RO separately were close to the raw proportions over the same sample (see Table 2) because the design was (largely) balanced and the predictions did not include an adjusting variable. The overall LO prediction was 0.78 (95% CI: 0.75-0.81), but by sonographer this ranged from 0.65 to 0.89. By reviewer, the range was from 0.59 to 0.93. For RO, predicted probabilities were typically higher; overall prediction was 0.80 (95% CI: 0.77-0.83), sonographer predictions ranged from 0.62 to 0.97 and reviewer predictions ranged from 0.66 to 0.94. Not all sonographer or reviewer rank orderings were the same for LO and RO, for example reviewer 7 was the lowest for LO and reviewer 5 for RO. This was in contrast to the raw proportions where reviewer 7 gave the lowest percentage of confirmations for both LO and RO. In a separate model where expert was replaced by 'qualification', sonographers had significantly higher confirmed VR for both LO (β=0.74 95% CI: 0.38-1.10) and RO (β=0.86 95% CI: 0.40-1.32) compared to gynaecologists (Table 1). Radiologists also had higher confirmed VR than gynaecologists but this was only significant at the 5% level for LO. The mean cVR-Both obtained using the model was 67.2%, ranging from 47.6% to 86.5% (95%CI: 63.9-70.5%, Table 2

Discussion
Our audit suggests that sonographer's self-reported visualization rates of postmenopausal ovaries they judged to have normal morphology is unreliable. Our study was facilitated by the unique TMS and URA systems employed in UKCTOCS which permitted a retrospective review of the images and measurements recorded by the sonographer. It could be argued that the static images used for this audit represent a snapshot of a continuous pelvis examination so might not truly represent what was seen by the sonographer. Nevertheless, these static images were used to measure the ovaries, so the structure marked by the callipers was definitely considered to be an ovary by the sonographer.
We analysed the data using a statistical model that accounted for the correlated structure of the data, between left and right ovary scans, and between the same scan viewed by the experts. Normality was assumed for the underlying latent variable ('propensity to confirm visualisation') and for the distribution of the ovary-specific volunteer random effects. The model gave predictions in the probability scale that different only slightly from the raw proportions, due to the nature of the study design. One clear benefit to using a statistical model with random effects is that all the data could be analysed together, and producing variance component estimates that allow the calculation of ICCs. The value of the ICC was higher for the right ovary then left, though not significantly different, and for both were modest: 0.40 for LO and 0.51 for RO when excluding the expert term from the fixed effects, the only variable that varied over each scan's repeated assessments. Hence the ICC is a measure of inter-rater (expert) agreement, and suggests that although there is moderate concordance, the experts cannot be relied upon to replicate the judgements of each other. However, such lack of agreement in respect of each individual scan does not change of the overall conclusion of the audit in terms of the unreliability of the sonographer's self-reported visualization rates.
We have previously reported on the Quality Control (QC) of UKCTOCS TVS scanning with similar exam selection criteria (ovaries were seen and normal) 7 . A single expert reviewed 1000 randomly chosen TVS examinations which had been performed by 96 sonographers. The expert's cVR-Both was 50% compared to the 100% VR as self-reported by the sonographers for these examinations. This result is broadly consistent with the results reported in this study for the group of seven sonographers with mean cVR-Both of 67.2%. The significant variation in cVR-Both across sonographer of normal postmenopausal ovaries is probably due to differences in sonographer ability and the subjective nature of this examination; a supposition supported by findings reported by Sharma et al. 8 .

Limitations of the study
Intra-observer reproducibility was not addressed so the capability of individual experts to provide consistent results for the same exams was not measured. The study design was generally balanced, and potential confounders that might possibly affect visualization should be expected to be evenly distributed across experts due to the randomization process. However, it is conceivable that these confounders may not be balanced across sonographers, due to potential geographical differences in their distribution. This was not a major concern, but the factors could have been seamlessly absorbed into the model and produced sonographer predictions conditional on equal covariate distribution.

Conclusion
The results of this audit confirm that the visualization of postmenopausal normal ovaries by seven 'high performing' sonographers, as assessed by eight experts, could not be considered reliable given that in almost a third of their examinations structures other than an ovary had been mistakenly measured in at least one of the ovaries. However, individual sonographer performance varied significantly from 47% to 87% cVR-Both. These results show that it is possible for some sonographers to correctly visualize both ovaries when scanning a range of menopausal women so raising the possibly that other sonographers might achieve similar results if supported by a suitable quality improvement programme.
This audit highlights the problem of sonographers routinely mistaking other features like the bowel as ovaries when scanning postmenopausal women. It also highlights the difficulties of providing effective Quality Control (QC) for such scans in a large scale screening programme. Specifically, it shows that undertaking the type of expert review conducted by this study for a substantial number of sonographers on a regular basis would not be feasible without creating dedicated teams specializing in normal ovary identification from TVS images of postmenopausal women. Therefore there is a need for further research to explore how independent and reliable QC metrics for TVS might be obtained by other means, for example by the automated analysis of TVS scan images both static and video. Recent advances in machine learning research, particularly in the area of deep neural networks, suggest it might soon be viable to construct a system able to determine sonographer VR from a collection of images captured during a series of TVS examinations. Indeed, the use of such deep learning techniques in the gathering of quality metrics from obstetric ultrasound images is already reporting some promise 10 .
The work done by the UKCTOCS group on the QC of TVS scanning seeks to improve understanding of challenges associated with performing screening for ovarian cancer on a large scale and at multiple centres. All previous studies of ultrasound screening of postmenopausal ovaries for the early detection of cancer (excepting the recent QC study by our group) have accepted the self-reporting of ovarian visualisation rates as accurate. This is the first published audit of self-reporting of ovarian visualization rates and the results cause us to question the reliability of this metric, particularly for QC purposes.

Ethics approval
The  . A comprehensive quality assurance program for TVUS was undertaken . This current paper is the result of an audit ordered by the trial's Ultrasound Management Subcommittee of seven sonographers reporting rates of visualizing both ovaries of > 89% after an accreditation program done by UKCTOCS in 2008. Eight experts reviewed 357 archived, static images centrally stored that also had measurement markers recording ovary dimensions. The mean visualization rate upon review for both ovaries fell to 67.2% with a range of 47.6 % to 86.5%. The range between the experts was 47.3% to 88.3%. The trialists are to be commended for the design and conduct of this impressive ovarian cancer screening trial which shows some evidence of stage-shift and mortality reduction but after many rounds and long follow-up in a large group of women. The quality assurance plans in place to train and monitor TVUS results is impressive. While the technology dates to 2008-2009 and highly-selected experts might do better, the results of this audit strike a cautionary note about the rate of probably unavoidable non-visualization of ovaries with TVUS in post-menopausal women. They call for expanded efforts in automated analysis with machine learning. Improved technologies or biomarkers are also needed to realize the promise of lowering ovarian cancer mortality with early detection. 47.3%-88.3%. Agreement among expert reviewers was also modest. The authors conclude that further research is needed to develop reliable quality control metrics for transvaginal ultrasound.
It is unclear how much expert reviewers knew about methods for assembling the study set and other details, which may have influenced interpretations. The study did not include a random sample of scans from the trial to serve as "distractors" or to provide a reference for comparison, and reliability data were not compared with external standards, such as measurements of ovaries that may have been removed later, CA 125 levels or rare cancer outcomes. External reviewers who were uninvolved with UKCTOCS may be of interest and possibly could be achieved via a web-based approach, at least for a subset.
This study in combination with prior reports from UKCTOCS (e.g. Stott et al and Sharma et al) provides a composite picture of the performance of ultrasound in the trial; however, the generalizability of the current study is difficult to assess, given the unusual method of scan selection and the engagement of reviewers who were intimately knowledgeable about the trial and perhaps aware of the design of this project. Irrespective of these concerns, the data from UKCTOCS suggest that ultrasound of normal ovaries among older women has limitations.
Given that reviewers were experienced and specifically trained for the task at hand, there are additional unknowns about whether and how performance could be improved, and how much of reviewers' performances reflect inherent limitations of ultrasound for assessment of ovaries and ovarian cancer screening. Bodelon et al reported a high frequency of non-visualization of ovaries in the Prostate, Lung, Colorectal, and Ovarian (PLCO) screening trial, with a tendency for individual women to have repeated non-visualization. Further, although non-visualization is likely a marker of smaller ovarian size on average, it is notable that non-visualization conferred at best a marginally reduced risk of developing ovarian cancer in PLCO. Analysis of serial ovarian volumes in PLCO suggested that enlargement occurs rapidly within one to two years of cancer detection, and therefore, would be unlikely to have meaningful impact on clinical outcomes.
In a narrow sense, if ultrasound is to be used for ovarian cancer screening, then a better quality control metric than the frequency with which ovaries are visualized is needed. In a broader sense, this study and related literature call into question whether ultrasound imaging is useful in ovarian cancer screening, especially for high-grade serous carcinomas. To date, ovarian cancer screening and CA-125 has failed to achieve a reduction in ovarian cancer mortality. Although unproven, growing evidence points to the origin of many high-grade serous carcinomas, the most frequent lethal type of ovarian cancer, from the distal fallopian tube (fimbria), rather than from the ovarian surface epithelium. In contrast, other ovarian cancers (i.e. endometrioid and clear cell) may arise from endometriosis in the ovary and tend to remain organ confined for lengthier periods (present as stage I). Animal models of tubal cancer have shown that spread to the ovaries may accelerate disease progression (Perets et al ), but many questions about the pathogenesis of serous cancers among women, including the sojourn time of disease development and the role of the ovary in promoting metastatic spread. These fundamental questions raise larger issues about the role of assessing the ovary as part of cancer screening and the potential of ultrasound to identify cancers at early curable stage. Greater knowledge of the pathogenesis of ovarian cancer, especially high-grade serous cancers, may pose a barrier to improved early detection. Larger issues could be addressed to place the results of the study in context. We thank Prof Sherman for his review, but seek clarification so that we might improve our paper. We note that he does not consider our conclusions are adequately supported by our results. Does he believe our results show that orts from an individual sonographer about the ovary rep visualisation of her own scans produce more reliable quality control metrics than the when the reviewing the images the combined judgement of a team of eight experts sonographer used to measure ovaries as taken from a random sample her scans over a year? If so, we should be grateful if he would identify the data in our results that supports such a conclusion.

Is the study design appropriate and is the work technically sound? Yes
Are sufficient details of methods and analysis provided to allow replication by others? Yes

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
No competing interests were disclosed.

Competing Interests:
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com