Keywords
pooled testing, individual-level risk, machine learning, COVID19
This article is included in the Artificial Intelligence and Machine Learning gateway.
pooled testing, individual-level risk, machine learning, COVID19
The coronavirus disease 2019 (COVID-19) pandemic has placed a demand for massive, rapid, and accurate diagnostic testing. A number of reports recommend pooled testing to help increase testing capacity. For example, a recent report provides specific estimates of cost savings in a pooled testing setting.1 Biochemically, multiple reports demonstrate that RT-qPCR (reverse transcription quantitative real-time polymerase chain reaction) tests are amenable to pooled testing strategies.2–4
Pooled testing has a long legacy of quantitative methodologies and some practical implementation successes. A review of pooled testing can be found at Wiley StatsRef: Statistics Reference Online.5 Within pooled testing methodologies, hierarchical two-step approach is the oldest and the simplest. The approach involves splitting subjects to be tested in equal size groups (pools) and testing each pool first. If a group test result is negative, so is the entire group. If the group is positive, each individual in the group is tested individually. Many variations of this approach have been proposed over the years. The most relevant set of techniques uses individual-level risks in conjunction with pooled testing to determine appropriate pool sizes for more efficient testing. These methods are typically concerned with optimization for given a collection of specimens (e.g. Ref. 6). Unfortunately, such algorithms do not always fit the workflow of pathology and laboratory medicine. It is desirable to be able to make online (at the time of encounter) decisions to assign a specimen to a pool at a point the specimen is first received in the laboratory. Many labs have limited ability to manipulate and rearrange the specimens multiple times to establish optimal pools. Therefore, an online algorithm that provides an immediate recommendation about better pooling strategies for the specimens is of practical importance for successful implementation of pooled testing strategies. For this reason, fixed size pooling, which cannot account for individual-level risk even when available, is by far the most popular approach in practice.
Informatics and artificial intelligence tools have been mobilized to help with the pandemic by allowing for infection risk prediction at the individual level. These risk predictions can be leveraged workflows to prioritize valuable clinical resources.7 In this report I demonstrate that a greedy online algorithm for specimen assignment based on individual risk predictions can increase COVID-19 testing capacity in a way suitable for providing pooling recommendations for specimens as they come in for testing. Figure 1 presents and overview of this approach.
A. The algorithm relies on predictive model informed risk that a given specimen to be tested is positive, pi. Population prevalence rates and arbitrary predictive models can be used based on the available predictors, such as basic demographics, risk factors, symptoms, natural language processing derived features from clinical notes, etc. The probabilities are used to group a stream of specimens to be tested into pools that will be tested together. Should a pool test negative, all of the specimens in the pool are recorded as negative, resulting in increased capacity for testing. For a positive pool, additional ascertainment of each individual specimen will be required for a final result. B. online algorithm makes a decision for any new specimen to either add it to a pool that is being formed or to end forming that pool and start a new one with this specimen. The decision is made based on expected capacity gain by using pooled testing calculated from the pi of the specimen. EHR: Electronic health record.
Suppose the positivity rate is among the individuals to be tested. The probability that the pool of these individuals tests positive is , which is one minus the probability that all subjects are negative. Two-step pooled testing requires individual retesting of everyone in a positive pool (Figure 1A). Thus, the expected number of tests is The capacity gain is the ratio of the number of subjects tested to the number of physical units of test performed, Capacity gains with pooled testing are achieved when a pool of specimens tests negative (Figure 1A).
Suppose individual (a priori) estimates of being positive are available for each individual. Given these estimates, the probability that a pool tests positive is , and the expected capacity gain is . The key to deriving a greedy online algorithm is that both of these quantities, and , can be expressed as recurrence relationships, which depend on the quantities already computed for a pool of smaller size. This allows one to maintain an online estimate of the capacity gain of a collection of already processed specimens when making a decision about a new specimen.
The estimates of a priori individual risk have been obtained using logistic regression. The evaluation data has been collected incidental to a Medical University of South Carolina (MUSC) IRB (Pro00079660) approved study on the Living μBiome BankTM study8,9 of the microbiomes associated with infectious disease testing. The data consisted of COVID-19 test result (response) for the subjects with conclusive test result (“Positive” or “Negative”) for adult (age ≥18) subjects undergoing testing at MUSC Molecular Pathology Lab between March 12 and June 6, 2020 (32,851 cases in total). The design of this study was based on convenience sampling in a relatively short time interval. Cases obtained between May 21st and 6th, 2020 were not used for model fitting, and constituted the testing data. Predictors included subject age and indicator variables for whether the test is a (i) follow-up, (ii) immediately preceding test has been positive; (iii) the patient is hospitalized; (iv) hospital order location; and interaction term between age and (ii). Logistic regression model followed by stepwise backward feature elimination based on Akaike Information Criterion (AIC) was used for model selection in training data only. The performance of the model has been evaluated in both the training and the testing data separately. The analyses have been conducted using R statistical programming environment version 3.6.1.
The number of physical tests needed and capacity gains of the online algorithm has been compared with optimal uniform fixed pool sizes for based on population prevalence rates.1 For observations in each day in the testing data, averages of 1,000 permutations of the order of the specimens provided for order-independent estimates of the number of tests. One-sided Wilcoxon signed rank sum test was used to evaluate the hypothesis that online recommendations resulted in less tests.
The online pooling algorithm (Figure 1B) has been specifically designed to make pooling decisions about each specimen as it arrives for testing. The individual risk information and the estimates of the capacity gains from already processed specimens allows the algorithm to make a determination to add the specimen to the pool that is currently being filled or to close that pool and start a new one with the current specimen.
The logistic regression model was meant to provide simplistic estimates of individual risk to demonstrate the feasibility and utility of the approach. The variables included in the data showed statistically significant differences across the training and testing data (Table 1), indicating the potential for suboptimal predictive performance.11 The predictive model provides for a moderate predictive accuracy with 0.62 area under receiver operating characteristic curve estimate in testing data.
Variable (%) | Overall (32,851) | Data subset (n) | Difference in training vs. testing data, χ2 test, P value (degrees of freedom) | Included in the final predictive model | |
---|---|---|---|---|---|
Training (25,714) | Testing (7,137) | ||||
Positive | 4.68 | 4.71 | 4.55 | 0.59 (1) | Response |
Repeat visit test | 6.07 | 5.09 | 9.58 | <10-16 (1) | Yes |
Previously tested positive | 0.88 | 0.85 | 0.99 | 0.27 (1) | Yes |
Hospitalized | 13 | 13 | 12 | 0.066 (1) | Yes |
Tests ordered from hospitala | 14 | 15 | 13 | <10-5 (1) | Nob |
Age group (years) | Yes | ||||
19–40 | 28 | 29 | 25 | <10-12 (2) | |
40–70 | 53 | 52 | 54 | ||
>70 | 19 | 19 | 21 | ||
Age group (years) within subjects previously tested positive | Yesc | ||||
Total (n) | 289 | 218 | 71 | ||
19 – 40 | 27 | 23 | 38 | 0.0025 (2) | |
40-70 | 43 | 48 | 25 | ||
>70 | 30 | 28 | 37 |
Parameter estimates and other details of the model fit are shown in Figure 2. The model demonstrates that age, hospitalization status and whether the individual has been previously tested and/or tested positive are all good high level predictors of risk of positive test. The model is plagued by the imbalance of low and high risk patients, indicative of the relatively low population-level risk. Nonetheless, the predicted and empirical risk seem to correlate well, albeit with large variability in the high risk group.
A. R generalized linear regression function call and output following backwards stepwise elimination is illustrated. The features included in the final model were an indicator of whether this was a follow up test (Follow_up), an indicator of whether the individual had tested positive at any point previously (Previous_positive), age group (18–40, 40–70, >70), and indicator of whether the individual is hospitalized. An interaction of age and previous positivity is likewise retained in the model. B. Model performance evaluation included comparison of the empirical and predicted probabilities for groups of individuals with matching predictor values (follow up testing indicator, previous positivity, age, and hospitalization status). The model shows good concordance between the empirical and predicted risk in the lower risk range (inset), and large variability within the higher risk groups.
As is already known from the recent literature, two-step pooling can provide capacity gains over testing everyone individually. This is also demonstrated in our testing data using fixed pools of 5 or 6 specimens (Table 2). These fixed pool sizes have been chosen for comparison because of their optimality given prevalence rates. The evaluation of the online algorithm shows that the implementation of this approach may result in doubling of the testing capacity over testing individually (Table 2). Moreover, on 12 out of 17 days in the testing data the online approach resulted in less tests than fixed pool sizes. These differences were statistically significant for both fixed pool sizes (P value 0.003 and 0.002, respectively).
Total individual testsa | Number of positive subjects | Expected number of tests using alternative pooling strategiesb | Expected capacity increase by pooling strategyc | ||||
---|---|---|---|---|---|---|---|
Pools of 5 | Pools of 6 | Online | Pools of 5 | Pools of 6 | Online | ||
534 | 15 | 178.3 | 173.4 | 167.0 | 2.99 | 3.08 | 3.20 |
1,125 | 49 | 449.4 | 452.5 | 433.0 | 2.50 | 2.49 | 2.60 |
250 | 7 | 83.4 | 81.7 | 73.5 | 3.00 | 3.06 | 3.40 |
38 | 7 | 33.3 | 35.3 | 31.4 | 1.14 | 1.08 | 1.21 |
68 | 0 | 14.0 | 12.0 | 15.0 | 4.86 | 5.67 | 4.55 |
554 | 17 | 191.3 | 188.0 | 175.4 | 2.90 | 2.95 | 3.16 |
826 | 28 | 296.9 | 292.8 | 297.2 | 2.78 | 2.82 | 2.78 |
389 | 20 | 169.0 | 171.7 | 157.8 | 2.30 | 2.27 | 2.46 |
836 | 47 | 378.5 | 386.2 | 374.9 | 2.21 | 2.16 | 2.23 |
223 | 6 | 73.7 | 72.1 | 68.9 | 3.03 | 3.09 | 3.24 |
17 | 1 | 9.0 | 9.0 | 5.7 | 1.89 | 1.89 | 2.98 |
81 | 5 | 39.6 | 40.5 | 41.5 | 2.04 | 2.00 | 1.95 |
398 | 21 | 174.8 | 178.3 | 170.6 | 2.28 | 2.23 | 2.33 |
718 | 36 | 307.7 | 311.1 | 302.1 | 2.33 | 2.31 | 2.38 |
392 | 24 | 185.4 | 190.4 | 191.2 | 2.11 | 2.06 | 2.05 |
619 | 42 | 307.7 | 318.3 | 304.8 | 2.01 | 1.94 | 2.03 |
69 | 0 | 14.0 | 12.0 | 12.8 | 4.93 | 5.75 | 5.41 |
The online nature of the presented algorithm allows for its easy implementation in many existing laboratory medicine workflows. For example, it may be used to provide pooling recommendations as the specimens are scanned upon receipt in the lab for testing. This feature is unique and important for a feasible and practical solution that fits the existing laboratory medicine workflow. Alternative approaches that optimize the pools globally for a collection of specimens (e.g. Ref. 6) may offer better performance in terms of capacity gains, but require additional manipulation of the specimens to form the pools, which may be feasible in some, but not all workflows. Implementations of these global optimizing approaches may be feasible when pool assignments can happen off-line, for example during transport of a batch of specimens from a collection site to a testing facility. With that respect, the online approach offers simplicity and appeal for laboratory management that is traded for potential global suboptimality.
The exact implementation of online pooling approach may need to meet specific operational constraints to be practical. For example, some laboratories may only be capable of testing in pools of fixed maximum size. These constraints can be naturally incorporated into modified online pooling algorithms. More sophisticated versions of the algorithms are easy to imagine as well. For example, a parallel fulfilment of multiple pools simultaneously can be accommodated in a straightforward extension.
Institutional implementation of any pooled testing approach that utilizes individual-level risks requires solutions to many informatics ecosystem problems. First, the models providing individual-level risk predictions need to be updated frequently to account for changes in prevalence by risk factors, and other contributors to model drift. In practice, a nightly model fit update may be feasible and necessary. Second, the model predictions need to be triggered at an appropriate time between specimen collection and the time it arrives into a laboratory for testing. Third, pooling recommendations have to be either pre-computed or involve only lightweight computations. In either case these recommendations have to be easily available in the laboratory information system. Combined these challenges point to likely requirement of high intra-institutional cooperation between laboratory medicine, analytics, data science, and informatics operations.
The testing capacity increases by pooled testing rely on the quality of the predictive models. When predictors are not available population prevalence rates can be input into the online algorithm, and the resulting two-step groupings will be equivalent to optimal pooling into pools of fixed size. In this paper, the evaluation of the algorithm involved clearly suboptimal sets of predictors and risk prediction approach (multivariable logistic regression). Nonetheless, the online algorithm provides an improvement over simple two-step pooling. Better predictive models will result in even larger capacity gains. Improved model predictivity could result from employing the data that is readily available or computable at the time of diagnostic specimen collection. For example, many of the data elements from the Health and Human Services guidance on laboratory reporting10 could be included. Other structured data, such as questionnaires collecting evidence and degree of exposure, and telehealth-derived variables can prove useful as well. Likewise, unstructured text data processed by natural language processing techniques can be useful.7 Further, more sophisticated machine learning and artificial intelligence approaches may be used to combine all of the available data sources for superior risk estimates in online pooling recommendations.
The results reported herein are immediately translatable to laboratory medicine operations. Even without sophisticated predictive models, given the current state of the pandemic the online pooling algorithm can double the COVID-19 testing capacity. Better risk prediction models may result in even better capacity improvements. In the longer term, similar strategies can be used for implementation of massive scale testing for other diseases.
Access to the full dataset cannot be made available publicly since it contains elements of personal health information (PHI); however, access will be granted to readers and reviewers upon signing MUSC IRB-approved data use agreement. Please contact the author to initiate the process (alekseye@musc.edu).
Summary data and software in support of this work is available at https://github.com/alekseyenko/AIIAT/ and https://doi.org/10.5281/zenodo.7541444. 11 The file predict_positive_r2.pdf contains summary data.
Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 1.0 Public domain dedication).
The author would like to thank Katie Kirchoff, Bashir Hamidi, Jihad Obeid, Matthew Turner, Stephane Meystre, and Leslie A. Lenert for discussing the merits of the ideas presented in this brief report.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
Partly
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Data Science, Machine Learning, Artificial Intelligence, Bioinformatics, Image Processing
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
No
Are the conclusions drawn adequately supported by the results?
Yes
References
1. McMahan CS, Tebbs JM, Bilder CR: Informative Dorfman screening.Biometrics. 2012; 68 (1): 287-96 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Statistics, biostatistics
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Partly
If applicable, is the statistical analysis and its interpretation appropriate?
I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?
Partly
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Diagnostic virology, toxicologic pathology
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 23 Jan 23 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)