Evidence-based Clinical Decision Support Systems for the prediction and detection of three disease states in critical care: A systematic literature review

Background: Clinical decision support (CDS) systems have emerged as tools providing intelligent decision making to address challenges of critical care. CDS systems can be based on existing guidelines or best practices; and can also utilize machine learning to provide a diagnosis, recommendation, or therapy course. Methods: This research aimed to identify evidence-based study designs and outcome measures to determine the clinical effectiveness of clinical decision support systems in the detection and prediction of hemodynamic instability, respiratory distress, and infection within critical care settings. PubMed, ClinicalTrials.gov and Cochrane Database of Systematic Reviews were systematically searched to identify primary research published in English between 2013 and 2018. Studies conducted in the USA, Canada, UK, Germany and France with more than 10 participants per arm were included. Results: In studies on hemodynamic instability, the prediction and management of septic shock were the most researched topics followed by the early prediction of heart failure. For respiratory distress, the most popular topics were pneumonia detection and prediction followed by pulmonary embolisms. Given the importance of imaging and clinical notes, this area combined Machine Learning with image analysis and natural language processing. In studies on infection, the most researched areas were the detection, prediction, and management of sepsis, surgical site infections, as well as acute kidney injury. Overall, a variety of Machine Learning algorithms were utilized frequently, particularly support vector machines, boosting techniques, random forest classifiers and neural networks. Sensitivity, specificity, and ROC AUC were the most frequently reported performance measures. Conclusion: This review showed an increasing use of Machine Learning for CDS in all three areas. Large datasets are required for training these algorithms; making it imperative to appropriately address, challenges such as class imbalance, correct labelling of data and missing data. Recommendations are formulated for the development and successful adoption of CDS systems.


Introduction
Critical care, including intensive and emergency care, is the most expensive and human resource intensive area of in-hospital care. Despite having the most technologically advanced devices, it is the area associated with the highest morbidity and mortality rates 1 . Decision-making for clinical teams in this area is complex due to variability in procedures and dataoverload from the plethora of existing devices. In fact, misdiagnosis in the intensive care unit (ICU) is 50% more common than other areas 2 , and errors, especially medication errors which account for 78% of serious medication errors 3 , can have a long lasting effect even after patients are discharged.
Computerized decision support (CDS) systems have emerged as tools providing intelligent decision making based on patient data to address many of the challenges of critical care. CDS systems can be based on existing guidelines or best practices; and can also utilize machine learning as a means of compiling several data inputs to provide a diagnosis, recommendation, or therapy course. CDS systems can improve medication safety by providing recommendations relating to dosing 4-6 , administration frequencies 5 , medication discontinuation 6 and medication avoidance 5 . Moreover, these novel systems can improve the quality of prescribing decisions by triggering alerts or warning messages on drug duplication, contraindications, drug interaction errors 7 , side-effects and inappropriate medication orders 5 . CDS system notifications can be applied during the prescribing, administering or monitoring stages to detect and prevent medication errors 8 . These systems can also target patients to facilitate shared decision-making to empower as well as to motivate them 9-11 . The need for such systems stems from hospitals having to deal with strict guidelines to improve outcomes, document care cycles (raising the need for administrative tasks) and reduce readmissions. This is combined with the need to cope with financial constraints, such as staff shortages and increased pressure to reduce the length of stay 12,13 . Strategies for bringing CDS to clinics have been the topic of several workshops, conferences and focus groups 14 . Factors for success in designing CDS include providing measurable value, producing actionable insights, delivering information to the user at the right time, and demonstrating good usability principles 14 .
Early warning systems (EWS) are CDS systems designed for initial assessment and identification of patients at risk of deterioration in in-patient ward areas [15][16][17] . These systems have shown that they can enable caregivers and rapid response teams to respond earlier -in time to make a difference 18 . By alerting clinicians to higher risk patients, treatments can be administered early or harmful medications can be stopped, potentially leading to improved outcomes. Early recognition and timely intervention are also critical steps for the successful management of shock 19 , cardiorespiratory instability 20 and severe sepsis. In sepsis management, adequate timing of administration of antibiotics is directly associated with survival rates 21 , and incidence, severity and duration of infections.
According to the Society of Critical Care Medicine (SCCM) 22 , the five primary ICU admission diagnoses for adults are respiratory insufficiency/failure with ventilator support, acute myocardial infarction, intracranial hemorrhage or cerebral infarction, percutaneous cardiovascular procedures, and septicemia or severe sepsis without mechanical ventilation. SCCM also highlights other conditions involving high ICU demand such as poisoning and toxic effects of drugs, pulmonary edema and respiratory failure, heart failure and shock, cardiac arrhythmia and renal failure. Given the above, three high-impact areas were selected for the current research where early detection and treatment could impact outcomes for patients in the ICU. The first is that of hemodynamic instability, where early detection could help patients prevent deterioration into shock. The second is that of respiratory distress, affecting many ventilated patients (up to 40% are ventilated according to SCCM) 22 . The third area selected is that of infection, with a focus on sepsis. Sepsis is the most common cause of death among critically ill patients, with occurrence rates varying from 13.6% to 39.3% 23,24 . All three areas are major areas of concern with relatively high prevalence in critical care having long term effects on patients.
The study focuses on both detection, which alerts the clinician to the presence of these specific conditions, as well as prediction of deterioration by alerting the clinician in advance that a patient will deteriorate into one of these disease states. The aims of this study were to perform and report a systematic review of the utilization of CDS systems in the three selected disease areas and summarize the methodological aspects of identified studies.

Search strategy
A systematic literature review was carried out to identify evidence-based study designs, methods and outcome measures that have been used to determine the clinical effectiveness of CDS systems in the detection and prediction of three populations representing the variety and majority of morbid conditions in a critical care setting: Shock (hemodynamic (in-)stability),

Amendments from Version 1
All comments from the Reviewers were addressed in the updated version. We could not address the layout issue that Reviewer 1 made as this is the Journal's decision how tables are made in the PDF. The question of Reviewer 2 regarding the rationale for including the studies predicting AKI within the Infection/sepsis results section is addressed here: Severe infection is a major cause of AKI in ICU patients, while conversely, AKI patients are at increased risk for infection [1]. Sepsis is an important cause of AKI, and AKI is a common complication of sepsis [2]. We felt that given this relationship, CDS for AKI fits well under this section. The reviewer is correct to propose the link between AKI and shock, however, not all AKI cases lead to shock-so we felt it matched this section more. [ Another method to ensure up-to-date results was to include conference abstracts from 2017 onwards regardless of whether or not they were followed up with a detailed publication. Ongoing studies identified in the clinical trials register were also kept in the review. Study protocols identified from bibliographic databases were, however, excluded assuming that final study results would be available and identified elsewhere. The strategy employed in PubMed is provided as Extended data, Table 1-Table 3 25-27 . Studies conducted in US, Canada, UK, Germany or France with more than 10 subjects per arm were included. These countries were selected because they are known to be active in CDS development. The inclusion and exclusion criteria for selecting abstracts and subsequent full-text publications were based on the population, interventions, comparators, outcomes, and study design (PICOS). These criteria are listed in Table 1.
Study selection and data extraction Study selection and data extraction was carried out by a single reviewer (MKK or SP). In cases of uncertainty, a second, or even third reviewer, was consulted. Data extraction was performed using a standard data extraction form (DEF). Key data from each additional eligible study were extracted by recording data from original reports into the DEF. The DEF included information on study design, inclusion/exclusion criteria, sample • Specificity (SD) (%) • NPV (%) • PPV (%) • Likelihood ratio The outcomes should be reported in the following manner: • per arm (study group vs. control group) individually; • difference between 2 arms.
Studies not reporting detection and/or prediction outcomes Studies discussing interventions of interest, but no outcomes are reported * Systematic Literature Reviews and (network) meta-analysis are excluded from data extraction since the pooled results cannot be used in our analysis. However, good quality (network) meta-analysis and systematic literature reviews (i.e. Cochrane reviews) will be used for cross-checking of references if the search did not omit any articles.
** If studies are conducted in multiple countries and at least 1 of the included countries is included -the study will be included in the selection. *** Mathematical and logistic regression models -can be used to validate and evaluate Interventions of interest (that are listed as included intervention), but the texts discussing these models without any "learning potential" or artificial intelligence potential will be excluded. Therefore, these models can be the foundation of the included listed interventions but will not be included in the Data Extraction Files unless they have also machine learning or artificial intelligence or some other form of "learning potential" on top of the statistical mathematical model. Researchers will pay special attention and caution when screening these abstracts and/or full-text articles. Studies identified from the ClinicalTrials.gov registry that did not report results were also included in the extraction to give some indication of the outcomes being collected.

Study quality appraisal
This research was not aimed at summarizing study results and assessing the relative effectiveness of CDS systems. Therefore, an appraisal of study quality was not deemed necessary.

Shock (hemodynamic (in-)stability)
The search yielded 1588 hits. Screening the titles and abstracts led to 1502 being excluded. The full texts of the remaining 86 titles were obtained and assessed against the PICOS criteria. Studies were excluded due to irrelevant study design (n=22), population (n=1), intervention (n=5), and outcomes (n=38).
A total of 20 studies were finally included in this systematic literature review. This included 5 trials identified from ClinicalTrials.gov. The study selection process is depicted in Figure 1. All studies, except one, trained a single algorithm. Ebrahimzadeh et al. 2018 30 trained and compared support vector machine (SVM), instance-based and neural network models to predict paroxysmal atrial fibrillation. SVMs were the most frequently used algorithms, followed by least absolute shrinkage and selection operator (LASSO) regularization. In one study, the SVM was trained using sequential minimal optimization 37 .
Machine learning models were trained and validated in 14 studies and subsequently tested in an independent dataset in 3 studies 19,35,37 . In one study an algorithm trained to classify arrythmias was not validated but compared to physician`s manual classifications 34 .
An overview of the investigated machine learning algorithms is presented in Table 3.
Outcome measures. Three of the 15 papers measured a single outcome of model performance. In two studies the preferred measure was accuracy 28,34 ; whereas in another study this was the ROC AUC. This study was large and based their algorithm on EHRs 33 . Across all studies, accuracy was reported in about half of the instances and the ROC AUC was one of the most frequently reported outcomes.
Sensitivity and specificity were reported together in 10 studies. Blecker et al. 2016 38 reported sensitivity together with PPV. Sensitivity and specificity were not measured in the study by Sideris et al. 2016 37 , instead model accuracy and the ROC AUC were preferred. This study was concerned with developing an alternative `comorbidity` framework based on disease and symptom diagnostic codes to cluster individuals at low to high risk of developing chronic heart failure.
PPVs were reported in six studies and accompanied with negative predictive values in two studies. These studies developed and validated machine-learning algorithms for the early detection of less investigated health conditions, these being hemodynamic instability in children 19 and acute decompensated heart failure 39 . The highest number of outcome measures, including likelihood ratios, was observed in Calvert et al. 2016 40 who investigated an underrepresented population of patients with Alcohol Use Disorder.
The outcomes measured are summarized in Table 4.
Ongoing studies. Five studies are currently ongoing, one in Germany 43 and the others in the USA 44-47 . Two studies are prospective case series 44,47 , two studies are prospective cohort studies 43,45 and one is a RCT 46 . Two of the studies are concerned with developing prediction models, and the others are concerned with implementing machine learning algorithms into clinical practice as early warning systems.
The details of these trials are summarized in Table 5.

Respiratory distress/failure
The search yielded 1279 hits. Screening the titles and abstracts lead to 1142 being excluded. The full texts of the remaining 137 titles were obtained and assessed against the PICOS criteria. Studies were excluded due to irrelevant study design (n=42), population (n=6); intervention (n=18) and outcomes (n=47), and conference proceeding from before 2017 (n=2). A total of 22 studies were finally included in this systematic literature review. None of the trials retrieved from ClinicalTrials.gov were included. The study selection process is depicted in Figure 2. The characteristics of all published studies are given in Table 6.

CDS systems.
About half of the studies developed machinelearning algorithms, whereas the other half focused on natural language processing (NLP) algorithms. One study differed from the rest by developing a computer-aided detection (CAD) system to measure the axial diameter of the right and left pulmonary ventricles, aiding in the diagnosis of pulmonary embolisms 49 . Many learning algorithms were concerned with detecting pulmonary embolisms and deep vein thrombosis 53,54,58,59,64-67 as well as pneumonia 33,48,57,60-63 . Three studies developed machinelearning algorithms to detect COPD 50,56,69 . One study developed a machine learning algorithm to detect acute respiratory distress syndrome 52 ; while other studies developed machine learning algorithms to detect respiratory distress or failure following a pressure support ventilation trial 67 , cardiovascular surgery 55 and pediatric tonsillectomy 51 . The classifiers used in the NLP-based studies were various. However, some commonalities emerged between the studies developing machine-learning algorithms. Multiple studies applied SVM, logistic regression, random forests, K-nearest neighbor (kNN), gradient boosting and neural network models. Various classifiers were explored in 5 studies.
Machine learning and NLP-based algorithms were trained and validated in 20 studies and subsequently tested in an independent dataset in 6 studies 52,56,60-62,67 . The CAD system mentioned above and an electronic pulmonary embolism severity index were trained and compared to a reference dataset classified by physicians 49,53 .
An overview of the developed learning algorithms is provided in Table 7.  Table 8.
Many of the studies that developed NLP-based algorithms reported negative and positive predictive values, as well as sensitivity and specificity. In contrast, the ROC AUC was the most frequently reported outcome measure of machine learning     algorithm performance. It was also the single preferred outcome in three studies 33,50,55 . About half of the studies additionally reported sensitivity, specificity, and accuracy. One study reported specificity with sensitivity set at 90% and 95% to ensure that few disease positive cases were missed 52 . The single study that developed a CAD system measured the ROC AUC and model accuracy 49 .

Infection or sepsis
The search yielded 2659 hits. Screening the titles and abstracts lead to 2562 being excluded. The full texts of the remaining 97 titles were obtained and assessed against the PICOS criteria. Studies were excluded due to irrelevant study design (n=41), population (n=4); intervention (n=6) and outcomes (n=14).
A total of 31 studies were finally included in this systematic literature review. Four of these were ongoing trials. The study selection process is depicted in Figure 3.

Study characteristics.
Of the included studies, 24 were conducted in the US. Three studies were conducted outside the US; one in France; one in the Netherlands and one in the UK. In total, 21 studies were retrospective 33,35,70-88 and six were prospective [89][90][91][92][93][94]     The characteristics of all published studies are given in Table 9.    81,88 and rule learning 70 . The most frequently applied model was random forest (15 studies) followed by logistic regression (10 studies), support vector machines (5 studies), naïve Bayes (5 studies) and gradient tree boosting (5 studies).
One study compared three different sampling methods for handling class imbalance; under-sampling the majority class (RANDu), over-sampling the minority class (RANDo) and synthetic minority over-sampling (SMOTE). This was a very large study including more than 500,000 patients to predict the onset of infections 75 . The authors found that SMOTE outperformed the other techniques and improved model sensitivity. Two other very large studies used the RANDu method 80 and mini-batch stochastic gradient descent with backpropagation 85 . No other studies were concerned with imbalance in disease positive and negative classification.
Machine learning models were trained and validated in 26 studies and subsequently tested in an independent dataset in four studies 35,72,75,77 .
The machine learning algorithms used are illustrated in Table 10. Sensitivity and specificity were reported together in 14 studies 35,[70][71][72]74,75,78,[81][82][83][84]87,90,92 . When specificity was not reported, sensitivity was reported together with PPV; and when sensitivity was not reported, this was due to sensitivity being set at a fixed value to report other diagnostic performance measures. In relation to the prior observation, more studies reported PPV than NPV. Four studies reporting likelihood ratios reported both negative and positive likelihood ratios 70,74,81,84 .
An overview of measured outcomes is illustrated in Table 11.
Ongoing studies. Four trials are currently ongoing, one in Germany and the others in the USA, all concerned with the prediction of sepsis. Three of them are prospective studies and one is retrospective. The retrospective study aims to develop a prediction algorithm based on claims data, EHRs, risk factors and survey data of an estimated 50,000 adult patients admitted to the ED. The German study NCT03661450 95 is a single-arm trial evaluating the utility of a CDS system to identify SIRS or sepsis from EHRs in a pediatric ICU population. Another single-arm trial NCT03655626 47 is concerned with implementing a sepsis prediction algorithm in clinical practice as an early warning system. NCT03644940 46 is comparing two versions of InSight introduced into clinical practice as an early warning system.

Discussion and conclusions
This systematic literature review shows that over the last 2 decades, there has been an increased interest in CDS as means of supporting clinicians in acute care. CDS has been investigated for several applications ranging from the detection of health conditions 60,61 , to the prediction of deterioration or adverse events 40,55,76,81,83,84 . Applications also include therapy guidance, as well as updating clinicians on new or changed recommendations 96 . CDS can also provide guidance by predicting clinical trajectories for different patient profiles over time 97 .
From rule-based algorithms and simple regression models, CDS has evolved to encompass a multitude of techniques in Machine-Learning 98 . These techniques can be dependent on the problem selected and the data types used. Across the three disease areas investigated, the frequent use of random forest classifiers (28.1%), support vector machines (21.9%), boosting techniques (20.3%), LASSO regression (18.8%) and unspecified logistic regression models (10.9%) were observed. The use of more complex modeling such as maximum entropy, Hidden Markov Models (for temporal data analysis) as well as Convolutional Neural Networks has also emerged over the last few years. In the respiratory distress area, the use of NLP models is more common as radiology reports and clinical notes are the main source of input. Different image analysis techniques have been developed to aid in the prediction and diagnosis of respiratory events from radiology images.
Typical measures of NLP model performance include sensitivity, specificity and predictive values. In measuring ML algorithm performance, sensitivity, specificity and ROC AUC are more common. A wide range of outcome measure were reported in research on less-investigated health conditions 40,67 ; and also when uncommon, more complex algorithms were compared to basic algorithms 74, 78,81,84 . This is not surprising given the novelty of these applications.
Many of the ML algorithms and all of the NLP models covered in this work were based on medical data collected in certain clinical sites rather than publicly available data. Datasets from national audits, completed studies or other online sources can additionally play a role, particularly in model validation and testing. This could aid in the adoption and wider use of CDS systems. In this SLR, publicly available datasets were mainly utilized for developing prediction models of heart arrhythmias 29-31 , hypotension 32 , septic shock 28,33,40,41 , COPD 50 , pneumonia 33 and a range of infections 33, 76,78,81,84,86 . In only three cases were they used for testing model performance in sepsis and septic shock prediction; this included the Insight algorithm 35, 85,93 .
Most of the studies identified in this SLR were retrospective and originated in the USA where electronic health records (EHR)   Providing the right information in the right intervention format, to the right person at the right point in their workflow, and through the right channel.
• Developing tools and concrete proof-points able to assess CDS efficacy in the clinic. This also highlights the importance of providing continuous feedback to clinicians.
• The importance of easy to use user interfaces and focusing on human-computer interaction during deployment.
• Efficient training that is available when needed.
• Being aware of alert or alarm fatigue and not overloading clinicians with alerts due to CDS. The intensive care unit is already plagued with alarms, and if anything, CDS should help in reducing alarms by bundling alerts according to underlying conditions.
• Displaying the rationale for decisions as well as the underlying data to clinical users would lead to improved adoption.
• Understanding ethical challenges for CDS, as well as a careful risk assessment in every site before deployment 106 .
• Being able to repeat/standardize implementation across organizations -most prospective studies reviewed in this work covered single centers. Only a few were multicenter studies.

Data availability
Underlying data All data underlying the results are available as part of the article and no additional source data are required , which permits unrestricted use, distribution, and reproduction in any medium, provided the original Attribution License work is properly cited.

Milena Kovacevic
Department of Pharmacokinetics and Clinical Pharmacy, Faculty of Pharmacy, University of Belgrade, Belgrade, Serbia The review summarizes the utilization of clinical decision support (CDS) systems in three selected states in critical care -shock/hemodynamic (in-)stability; respiratory distress/failure; and infection/sepsis. The background of the study has a strong rationale.
The study comprised the results from primary sources, describing models/algorithms used to detect and alert clinicians to the presence of these conditions, as well as models/algorithms developed to predict deterioration in an individual patient state, leading to these selected conditions. The systematic review was performed and the findings are presented in line with the PRISMA guidelines. Variables for which data were sought were clearly stated (PICOS) in Table 1.

Specific comments:
What I found especially beneficial for the readers and future research in this area, is Table 2 with the presented collected data used for training algorithms.
It would be beneficial to provide additional information whether an internal or external validation was performed -within Table 4 (measured outcomes in studies on shock), Table 8 (measured outcomes in studies on respiratory distress/failure) and Table 11 (measured outcomes in studies on infection/sepsis).
What was the rationale for including the studies predicting acute kidney injury within the Infection/sepsis results section? If it is about the decline in glomerular filtration rate due to hypotension seen in sepsis, it might have been presented within the Shock section. Table 7: include the abbreviations for ARDS (Acute respiratory distress syndrome), ARDE (Acute respiratory disease events) and DVT (deep vein thrombosis) below the Table.   Table 9: include the abbreviation for AKI (Acute kidney injury) below the Table. Are the rationale for, and objectives of, the Systematic Review clearly stated?
Yes Table 1: "Multivariable hierarchal logistic regression models*** (models which are based only on statistics -but there is no machine learning)", as an exclusion criterion ). This is clearly not the suitable platform to resolve this issue, but, the distinction between machine learning and statistics is not at all that clear. Specifically, under the term "supervised learning", any regression method (statistics) could be classified. So, logistic regression IS a machine learning method. So is LASSO and several other methods reported. Again, this is not the appropriate place for going into further details, but there is certainly some confusion, especially when in the results Logistic regression keeps appearing as a preferred method.
Again concerning terminology, the term "accuracy" appears often in the results section. Sometimes it is reported as a different outcome than i.e. ROC AUC, sensitivity and specificity. All the latter methods are quantifying "accuracy" in some way and some clarification is needed.
Minor comments: Table 1: Treatment/Intervention, a parenthesis is missing.
Tables 7 & 10: Maybe reverse the orientation of the column titles, it is impossible to read on a screen.

Are sufficient details of the methods and analysis provided to allow replication by others? Yes
Is the statistical analysis and its interpretation appropriate?

Not applicable
Are the conclusions drawn adequately supported by the results presented in the review? Partly No competing interests were disclosed. Competing Interests: Reviewer Expertise: Statistics I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.