Bayes Lines Tool (BLT): a SQL-script for analyzing diagnostic test results with an application to SARS-CoV-2-testing

The performance of diagnostic tests crucially depends on the disease prevalence, test sensitivity, and test specificity. However, these quantities are often not well known when tests are performed outside defined routine lab procedures which make the rating of the test results somewhat problematic. A current example is the mass testing taking place within the context of the world-wide SARS-CoV-2 crisis. Here, for the first time in history, laboratory test results have a dramatic impact on political decisions. Therefore, transparent, comprehensible, and reliable data is mandatory. It is in the nature of wet lab tests that their quality and outcome are influenced by multiple factors reducing their performance by handling procedures, underlying test protocols, and analytical reagents. These limitations in sensitivity and specificity have to be taken into account when calculating the real test results. As a resolution method, we have developed a Bayesian calculator, the Bayes Lines Tool (BLT), for analyzing disease prevalence, test sensitivity, test specificity, and, therefore, true positive, false positive, true negative, and false negative numbers from official test outcome reports. The calculator performs a simple SQL (Structured Query Language) query and can easily be implemented on any system supporting SQL. We provide an example of influenza test results from California, USA, as well as two examples of SARS-CoV-2 test results from official government reports from The Netherlands and Germany-Bavaria, to illustrate the possible parameter space of prevalence, sensitivity, and specificity consistent with the observed data. Finally, we discuss this tool’s multiple applications, including its putative importance for informing policy decisions.


Introduction
In December 2019, a cluster of patients with pneumonia of unknown origin was associated with the emergence of a novel beta-coronavirus, 1 first named 2019-nCoV 2 and later specified as severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2). 3 This outbreak led to the rapid development of reverse transcriptase -quantitative polymerase chain reaction (RT-qPCR) tests to identify SARS-CoV-2 RNA in specimens obtained from patients. 2,4 After sporadic SARS-CoV-2 positive cases in January 5,6 to the end of February 2020 worldwide cases of the SARS-CoV-2-associated disease 'COVID-19' began to accumulate, causing policymakers in many countries to introduce countermeasures. These non-pharmaceutical interventions predominantly started worldwide around March 2020 while the virus was characterized as a pandemic on 11 March, 2020. 6,7 As a result, for almost two years now, large parts of the world are in a COVID-19 crisis-mode with daily reporting of SARS-CoV-2 cases in dashboards worldwide. 8 The definition of 'cases' and 'prevalence estimates' was based on RT-qPCR testing, independent of the clinical diagnosis. Thereby, a person is considered a case (i.e., infected), once a test turns out positive. 9 Like all laboratory tests, however, the SARS-CoV-2 RT-qPCR tests are not flawless. This is because sensitivity and specificity depend on a multiplicity of confounding factors. These factors cover the test design, the lab application, and possible contaminations with substances/nucleic acids interfering with the reaction. 10,11 Consequently, both false-negative and false-positive results have been reported. 12,13 Nevertheless, the test system's limitations are rarely discussed in scientific publications and public health systems despite their crucial role for making inferences about the possible infection status of a tested person. 14 Many more or less defined commercial and laboratory 'in house' tests are now routinely being used, 15 often without standardised guidelines, which leads to entirely unknown test performance specifications. 16 The few studies aiming to estimate sensitivity and specificity of SARS-CoV-2 RT-qPCR tests have reported sensitivities and specificities in the ranges ≳30% and ≳80%, respectively -therefore, the communicated data seldom can offer precise distinctions. 14 Given the critical role that dashboards and graphs based on SARS-CoV-2 test results play for policymakers, health professionals, and the general public, 8 our objective was to develop a Bayesian calculator that could calculate test quantities and prevalence solely based on officially reported numbers of total and positive tests, i.e., without making any a priori assumptions. In this way, time trend estimates and country-to-country comparisons of these test performance measures as well as disease prevalence estimates become possible, producing in-depth insights, making projections/ simulations possible, and providing a more holistic understanding of the daily incoming data in general.

General description of the calculator
The Bayes Lines Tool (BLT) calculator is based on Bayes' theorem and estimates the true and false positive, and true and false negative numbers at a given time point for which the total number of tests performed and the number of positive test results is known. These data are usually reported and published by official government bodies daily and/or weekly. Thus, the model uses the following information: • Publishing date or report identifier of the test data • Number of performed tests (#tests) • Number of reported positive results (#positives) The model takes this information as a given fact and uses it to make inferences about the test performance parameters (sensitivity and specificity) as well as the prevalence (also known as the base rate) -these inferences are essential for estimating the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). It is assumed that there is no knowledge of either the prevalence or the sensitivity and specificity of the tests used. Instead, the model explores all possible combinations of two of these three parameters within reasonable ranges specified by the user; for each of these combinations, the third parameter can then be calculated using the dependencies through Bayes' theorem. Finally, all parameter combinations that result in TP+FP estimates consistent with the known number of positive tests are selected and stored as confusion matrices.

REVISED Amendments from Version 2
The new version includes some further explanations about the usage of the BLT calculator and how results should be interpreted.
Any further responses from the reviewers can be found at the end of the article A single confusion matrix contains TP, FP, TN, and FN in absolute numbers (Table 1). For a given prevalence, sensitivity, and specificity these are derived from Bayes' theorem: Here, T denotes the hypothesis that a test comes out positive (¬T its denial) and I the hypothesis that an individual is infected, so that P I ð Þ is the prevalence and P TjI ð Þis the test sensitivity. P T ð Þ is the marginal probability of a positive test, which we estimate as the frequency of positive test results, whereas P IjT ð Þis the probability of being infected given that the test came out positive. With the normalizing constant P T ð Þ estimated as P T ð Þ¼ #positives #tests and P IjT ð Þestimated as the proportion of infected individuals among those in which the test came out positive, equation (1) becomes: Equation (2) thus shows that the number of TPs depends on the prevalence, test sensitivity and total number of tests performed. Using P ¬Tj¬I ð Þ=specificity and #negatives = #testsÀ#positives, an analogous derivation leads to From Equations (2) and (3), FP and FN follow as FN ¼ #tests À #positives À TN (5)

Implementation
For the implementation presented here, the two parameters which varied are as follows: • Sensitivity from 0.005 to 1 with 0.005 increments.
For a given sensitivity and specificity as well as number of tests and positives, the prevalence can then be computed as Hereby, calculations for combinations of sensitivity and specificity that add to ≤1 are omitted, and cases in which prevalence turns out negative or larger than 1 are discounted as unphysical.
We developed an SQL query that generates all possible Bayesian confusion matrices for a series of diagnostic test results, without making assumptions about prevalence, sensitivity, or specificity.
The code in PostgreSQL is given as follows (Code 1): with tests as ( select :reg :: text as region_name, :rid :: text as report_id, Given the test results published in the databases and given all generated permutations and consequently all possible confusion matrices, only those are returned that match the positive test results. With only the resulting confusion matrices for which TP+FP match the positives reported in the input data, we are able to identify patterns that provide additional insights for further investigation. In order to produce confusion matrices for a series of reports, such as daily test result numbers, several approaches are possible. In this manuscript we describe a practical application for using a Batch/Script approach. The Script is used on Apple OSX, the example below using COVID-19 data from the Netherlands (Code 2):

Results
In the following section examples are provided that demonstrate the application of our calculator for the data referenced in Section 2.3.

A hypothetical scenario
Consider the following hypothetical scenarios displayed in Table 2 that we used for a general check of BLT's performance. In scenarios 1 and 2, we consider a disease which has a prevalence of 20% in two different subpopulations (e.g. young and old people, respectively). The prevalence was chosen for illustrative purposes only; in most real-world situations, much lower disease prevalence values would be encountered. Each subpopulation has its own test characteristics: In subpopulation 1, test sensitivity is 95% and specificity 75%, while in subpopulation 2, sensitivity is 75% and specificity 95%. Consider that 10,000 tests have been performed in the total population. In scenario 1, the total population consists of an equal mix of both subpopulations, while in scenario 2 the total population consists of 75% subpopulation 1. The different mixture of subpopulations leads to a different number of positive test results, and hence a different input for BLT. The overall test performance measures (sensitivity and specificity) are a weighted average between the subpopulation test performance measures. This is called the spectrum effect. 17 Now consider a different scenario, in which the total population is a mix between two subpopulations with different susceptibility towards the disease, and hence different prevalence, but the test performs equally well in both subpopulations.
In scenario 3, each subpopulation contributes 50% to the overall population, while in scenario 4, the less susceptible population contributes 80% (8,000 tests). Now the overall prevalence is the weighted average of the subpopulation prevalence values, and overall test sensitivity and specificity are equal to those of the subpopulations.  Table 2 is also visible in Figure 1, as it translates into the percentages of TPs, TNs, FPs and FNs. What is critical is the fact that BLT, which only works with the total number of tests and positives obtained, would not be able to distinguish between scenarios 1, 3 and 4. All three are compatible with the output set of confusion matrices. One should thus keep in mind for the interpretation of BLT's output that the solution corresponding to reality is determined by the mix of subpopulations being tested, which in turn might have their own specific subpopulation prevalence, sensitivity and specificity values. In other words, one should be aware of the spectrum effect. 17,18 If possible, one should thus use knowledge about prevalence and test performance measures to filter out the confusion matrices consistent with what is known about "the reality".
3.2 California/USA (diagnostic Influenza-testing) Figure 2 shows the results of applying BLT to weekly influenza test data from the Californian Bay Area, USA. The upper panel displays the number of positive tests reported over time, where the estimated number of TPs is overlaid in small dots (confusion matrices) whose color represents the estimated prevalence (see legend on the right of Figure 2). Filters   have been applied on specificity (95.0% -100.0%) and sensitivity (80.0% -100.0%). One could see that the number of TP tests is close to number of positives reported, except for some deviations during the spring and summer months when prevalence was estimated correctly as low.
The lower panel shows the positive predictive value (PPV), for each confusion matrix, defined as PPV ¼ TP TPþFP , which confirms a high accuracy of the tests: The median PPV of all confusion matrices over time was almost 90%. It can be observed that in contrast to the influenza example (Figure 2), the PPVs are now much lower, with a median average around 50%. For this estimation, no filters were applied on sensitivity, specificity or prevalence. When a posteriori knowledge is available about the diagnostic tests and/or the circumstances in which they were performed, different scenarios can be applied to the output. This is exemplarily visualized in Figure 4, in which some reasonable filters for a SARS-CoV-2 testing environment have been applied. Notice how the PPV started to increase sharply from a median around 50% before mid-September 2020 to 80-90% during the fall and winter.
Finally, Figure 5 shows the negative predictive value (NPV) for the Netherlands data with similar filters as in Figure 4, except for choosing a less optimistic sensitivity range of 60-80%, which is consistent with some clinical data. It can be noticed that NPV remains relatively high throughout the entire time range. Median NPV over time does not drop below 90%, even after reducing the range for sensitivity to as low as 60-80%. We also tested the impact of this lower sensitivity range on the PPV, but could not detect any visible impact, consistent with the finding that low-specificity tests cannot distinguish between the hypotheses that a positively tested individual is infected with SARS-CoV-2 or not regardless of sensitivity. 14  Figure 6 shows the output of BLT applied to weekly SARS-CoV-2 testing data from Bavaria in Germany. The thick grey line displays the number of positive tests reported over time, while the colored batches show the solutions of BLT for the TP numbers according to prevalence. Note that in low prevalence scenarios, the TPs do usually not come close to the  reported number of positives. At the end of the summer, the prevalence values compatible with the official test reports suggested low prevalence, but also a discrepancy between the number of positive tests and TPs, suggesting a large number of FPs.

Discussion
The developed Bayesian calculator tool allows the estimation of possible values for the essential variables' prevalence, sensitivity, and specificity for a specific period of time (e.g., daily or weekly, depending on the input data the user supplies). The solutions provided by BLT are derived from Bayes's theorem (Equation 1) under the assumption that P T ð Þ¼ #positives #tests and P IjT ð Þ¼ TP #positives . In cases of low total and positive test numbers, these assumptions might not hold exactly, but BLT should nevertheless find close solutions to the actual test performance measures. As our applied examples show, the strength of BLT lies in its application to mass testing scenarios such as those conducted during the SARS-CoV-2 crisis.
The BLT calculations are unbiased in the sense that they use all possible and sensible combinations of prevalence, sensitivity, and specificity, and let Bayes' theorem decide which combinations match the actually observed data. The result for a given matching combination of these three particular parameters is provided in the form of a confusion matrix which contains the TP, TN, FP, and FN numbers. In the case where more than one combination is compatible with the given input data, the user may start simulating different scenarios, e.g., by applying prior knowledge regarding the expected prevalence range on a given date and test sensitivity and specificity estimates. This enables the user to further constrain the combinatorial possibilities of the output variables. For example, if disease prevalence in our hypothetical examples given in Figure 1 would have been known to range around 20%, lower and upper bounds for the TP, TN, FP, and FN percentages could be readily obtained from this graph. Thus, one would learn that a positive test result should not be trusted with high probability, but a negative test result would be very reliable. It is important to emphasize that there is no "wrong" output of the BLT calculator, since the output logically follows from the laws of probability; it is the responsibility of the user to decide which output possibilities best apply to the real situation under which the test had been performed.
Prevalence is a crucial factor for any inferences based on diagnostic tests, even though it is often not taken into account in practice. This results in the so-called base-rate fallacy. 19 Our calculator may result in several possible prevalence values that are compatible with the observed data. In this case, knowledge about the population that has been tested should be used to constrain the possibilities. In 2020, for instance, prevalence-values in the range 12-15% were estimated for German hotspot regions, 20,21 while prevalence was zero in an asymptomatic German mother-and-child population tested in April 2020. 22 In an early COVID-19 related publication which compared RT-qPCR to chest computer tomography in 1014 COVID-19 patients from the Tongiji hospital in Wuhan, China, prevalence appeared to be very high: in total 830 patients were described to be confirmed or highly likely to have COVID-19, and of those 580 were diagnosed by chest CT and RT-qPCR and another 250 by CT and clinical decision. These results suggest a prevalence of 81.9% in these patients. A preprint publication 23 aimed at estimating the sensitivity and specificity of the Chinese RT-qPCR tests by a Bayesian model incorporating information from both chest CT and clinical decision classification. The author obtained sensitivity of 0.707 (95% CI range: 0.668-0.749) and specificity of 0.851 (95% CI range: 0.774-0.941). Applying BLT to these data and assuming that only the cases in which both chest CT and RT-qPCR came out positive (i.e., filtering on 580 TPs), our model reveals a sensitivity of 65.3% and specificity ranging from 83.1%-83.6%, not too different from the estimates of the more complex analysis. 23 During the SARS-CoV-2 crisis an unprecedented mass testing not only of symptomatic, but also asymptomatic cases emerged as a strategy. One would expect the prevalence to be substantially higher in the former than in the latter population. As our scenarios 3 and 4 from section 3.1 shows, if there is a mixture of two populations with very different prevalence values, the resulting overall prevalence is a weighted average, provided that the sensitivity and specificity of the tests is similar in both populations.
Our results display the known dependence of a test predictive value from the disease prevalence. For example, the World Health Organization (WHO) stated "that disease prevalence alters the predictive value of test results; as disease prevalence decreases, the risk of false positive increases". 24 This means that the probability that a person who has a positive result (SARS-CoV-2 detected) is truly infected with SARS-CoV-2 decreases as prevalence decreases, irrespective of the claimed specificity of the test system. 24 This statement may be more accurately described as the number of TPs decreasing relative to a constant FP rate so the 'risk of false positives' only increases relative to the TP numbers, but the FP frequency is assumed to remain constant across a given number of tests. However, multiple modes of error may be in play. We should not assume FPs are independent of contamination from TP samples. There are higher risks of contamination in rapidly growing laboratories. Contamination of samples in the low disease prevalence seasons (summer) will go unnoticed as they do not produce a qPCR signal. Contamination prone methods may only become evident in the form of elevated and perhaps falsely assumed TPs once the disease prevalence increases in the winter.
In light of the above WHO statement, the rationale for mass testing strategies implemented during periods of low prevalence (e.g., summer) appears questionable. Furthermore, mass testing increases the risk of poor sample handling and laboratory contamination which might partly explain the high FP numbers our calculator predicts. For example, Patrick et al. argued that besides intrinsic test performance, amplicon contamination due to high throughput processing of samples within a laboratory would be the best explanation for an increased rate of FP detections made during an outbreak of the human coronavirus HCoV-OC43 in a Canadian facility. 25 While much attention has been placed on population frequency of disease and its impact on false positives, it is critical to understand the role of false negatives and the impact these can have on track and trace systems. The nasal swabs are known to vary tremendously in RNaseP Ct values suggesting highly variable sampling or limited RNA stability in the testing reagent chain. 26 Woloshin et al. demonstrate 27-40% FNs with nasopharyngeal and throat swabs respectively and underscore the importance of understanding pre-test probabilities when interpreting qPCR results. 27 These FN numbers are probably not due to the PCR itself, but are related to handling issues and the above discussed problems, as well as the time point within the course of infection that the sample is taken. In a meta-analysis of clinical data, Kucirka et al. found that the probability of a FN test was 100% at day 1 of an infection with SARS-CoV-2 (prior to symptom onset), and then decreased to 38% (95% credible interval 18-65%) at the day of symptom onset down to its minimum of 20% (12-30%) three days after symptom onset, after which it rose again to 66% (54-77%) three weeks after the infection. 28 Hence, according to these numbers, even in infected individuals sensitivities below 30% are possible, a range that we excluded in our analysis consistent with Klement and Bandyopadhyay. 14 This points to additional problems when testing asymptomatic individuals, because in case that they are truly infected, a high number of FNs is going to result.
With the script presented here, we can think of many variations when it comes to the range of sensitivity and specificity, their step-sizes (granularity) and the 'where' clause as well as the strictness of matching TP+FP against the reported positives. For example, one could also increment prevalence on a log-scale to account for the fact that prevalence in many settings of diseases is very low. 14 We are aware that choices made in these areas have a significant impact on the number of matching confusion matrices. An impact/sensitivity analysis was not performed, although we suspect that such analysis might reveal additional insights. However, we think that the amount of matching confusion matrices per result that the above query produces delivers sufficient material to make useful observations. Future research with different data-repositories, for instance ECDC/TESSy-data would be very beneficial to identify a solid balance between precision (step-size in the permutations), number of matching confusion matrices, and overall query performance.

Conclusions
We have developed an easy-to-use Bayesian calculator (Bayes Lines Tool, BLT) to estimate prevalence, sensitivity, and specificity, and therefore TP, TN, FP, and FN numbers, from official test outcome numbers. With typical reportsespecially as produced for SARS-CoV-2 tests -revealing just the number of positives and number of tests performed, the BLT SQL implementation generates confusion matrices that fit within the boundaries of a typical simplified report, based on permutations of sensitivity and specificity. Its implementation is thereby not limited to SQL but can be applied on any platform of choice.
The ability to assess posterior probability independent of the circumstances in which diagnostic tests are performed, reveals a wide spectrum of opportunities for new applications both for the scientific community as well as for health professionals and policy makers around the globe. This is especially relevant for the mass testing taking place within the containment strategies of worldwide governments against the SARS-CoV-2. The BLT SQL query for the first time allows one to display a real estimation of the SARS-CoV-2 situation against the background of testing volume and quality and thus will provide a valuable tool for decision makers to monitor the test strategy and the effect of interventional procedures.
This tool will not only allow official institutions to survey the test situation and obtain a better basis for planning their interventions, but also allows for individuals who got tested to use the confusion matrices as an aid for interpreting their test results in view of the population they were tested in.

Data availability
Underlying data All data underlying the results is linked in section 2.3 of the article. The hypothetical example is given in Table 2. No additional source data is required.

Software availability
Zenodo: Bayes Lines Tool (BLT) -A SQL-script for analyzing diagnostic test results with an application to SARS-CoV-2-testing, http://doi.org/10.5281/zenodo.4594210. 29 Code is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).
The SQL-code and an example implementation in Excel and a Tableau work-book file can be downloaded at https:// bayeslines.org/.

Open Peer Review
professionals and policy makers, and even for individuals to interpret their own test results.
As I am not familiar with SQL-code and would have no idea how to implement this calculator on my own computer, I thought that the authors partly provided "sufficient details of the code, methods (…) to allow replication of the software development". So I have not checked for mistakes in the code, or whether the calculator actually works. It would be really helpful if someone could do that.
Technically speaking, the formulas and the explanations all seem to be in order. I have little to comment on that. However, there are a few semantics-issues that may be resolved, and the rationale of a separate calculator is not entirely clear to me.

Some more specific comments:
Using the phrase 'Bayesian calculator', implies to me that Bayesian statistics have been used, and that the factors the authors mention in their abstract and introduction (sample handling, underlying test protocols etc.) have been taken into account to go from a prior belief about sensitivity/specificity/prevalence to a posterior (after accounting for other factors) belief. However, when reading the manuscript, it turns out that the BLT is nothing more than a huge number of permutations given a starting value of positive and negative results. It provides a range of possible true values, without providing an indication of how realistic all these possibilities are. Therefore, I think the authors are overselling a relatively simple calculator and that the new thing of this BLT is actually only the way the data are presented. I can do these calculations in Excel as well, but that would give me a headache to provide the right figures and graphs. So maybe the whole article should tone down the novelty a bit.

1.
In my previous comment I mentioned the lack of information about how realistic some predictions may be. Would it be possible to add this information? 2.
If I were a policy maker and I would get 3.
I find abbreviations and acronyms confusing, although that may be a personal thing. For example, CM for confusion matrix is only one word less in the word count every time CM is being used. But using the full term is much more informative and easier to read.

4.
I am not sure whether the examples given (20% prevalence in the general population) are realistic. I think the true prevalence of SARS-CoV-2 infections has been much lower at any given point in time.

5.
The authors stated that: "Our results confirm the recent World Health Organization (WHO) statement "that disease prevalence alters the predictive value of test results; as disease prevalence decreases, the risk of false positive increases"." However, this is an inherent given for predictive values. It is in their calculation. So this statement follows from logic, while the way it was written here, it implies that the authors have proven this WHO statement to be correct. And it implies that this is a recent finding. Both are not true. So

6.
for which sensitivity is almost 100% (https://www.finddx.org/covid-19-old/ sarscov2-evalmolecular/), but a matter of handling issues and the above-discussed problems." This is incorrect. The false negative rate of the PCR test is documented as a function of time (see ref 1), 1 and is mostly related to low early virus shedding initially. The middle phase, when the test has the lowest false negative rate, is as the authors describe, but then the lack of virus particles as the patient recovers becomes the major factor. Laboratory-based validation of PCR testing, as cited by the authors, differs from the clinical FN as seen in the present study due to potential flaws in the entire sample collection, handling, and processing chain. It is important to consider the whole chain in evaluating error rates. This is perhaps the most important aspect of test errors, since FP results in some inconvenience and/or worry whilst FN provides a dangerous false sense of security.
The analysis of Bavaria (figure 5) is roughly consistent with the numbers reported by Kucirka, but those of the Netherlands (figure 4) suggest a sensitivity range that is highly optimistic. The specificity on the other hand could be very good depending on the regional expertise.

Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Physics, data analysis, simulations, computer modelling, game theory, strategy.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
30% are possible, a range that we excluded in our analysis consistent with Klement and Bandyopadhyay. This points to additional problems when testing asymptomatic individuals, because in case that they are truly infected, a high number of FNs is going to result." We try to avoid judgements such as yours ("since FP results in some inconvenience and/or worry whilst FN provides a dangerous false sense of security"), which may invoke subjective arguments regarding "inconveniences". We acknowledge that this is the general perception about FN versus FP and will elaborate further on this, with your last comment / point. We appreciate that you point this out and will add an additional figure to prevent any suggestion of bias in the article.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com