Updating the evidence on the effectiveness of the alcohol reduction app, Drink Less: using Bayes factors to analyse trial datasets supplemented with extended recruitment

Background: A factorial experiment evaluating the Drink Less app found no clear evidence for main effects of enhanced versus minimal versions of five components but some evidence for an interaction effect. Bayes factors (BFs) showed the data to be insensitive. This study examined the use of BFs to update the evidence with further recruitment. Methods: A between-subject factorial experiment evaluated the main and two-way interaction effects of enhanced versus minimal version of five components of Drink Less. Participants were excessive drinkers, aged 18+, and living in the UK. After the required sample size was reached (n=672), additional data were collected for five months. Outcome measures were change in past week alcohol consumption and Alcohol Use Disorders Identification Test (AUDIT) score at one-month follow-up, amongst responders only (those who completed the questionnaire). BFs (with a half-normal distribution) were calculated (BF<0.33 indicate evidence for null hypothesis; 0.33<BF<3 indicate data are insensitive). Results: Of the sample of 2586, 342 (13.2%) responded to follow-up. Data were mainly insensitive but tended to support there being no large main effects of the enhanced version of individual components on consumption (0.22<BF<0.83) or AUDIT score (0.14<BF<0.98). Data no longer supported there being two-way interaction effects (0.31<BF<1.99). In an additional exploratory analysis, participants receiving four of the components averaged a numerically greater reduction in consumption than those not receiving any (21.6 versus 12.1 units), but the data were insensitive (BF=1.42). Conclusions: Data from extended recruitment in a factorial experiment evaluating components of Drink Less remained insensitive but tended towards individual and pairs of components not having a large effect. In an exploratory analysis, there was weak, anecdotal evidence for a synergistic effect of four components. In the event of uncertain results, calculating BFs can be used to update the strength of evidence of a dataset supplemented with extended recruitment.


Introduction
A factorial experiment evaluating the effect of 'enhanced' versus 'minimal' versions of five components of the alcohol reduction app, Drink Less, found no clear evidence for simple effects but did find evidence that two-way combinations of certain 'enhanced' components together resulted in greater reductions than 'minimal' versions 1 . This was a planned analysis but should be interpreted with caution as the two-way interactive effects were not specifically hypothesised a priori and were part of multiple interactions tested. Findings of this sort are not uncommon in experimental studies. One approach is to start another randomised trial specifically to test this hypothesis. A potentially more efficient alternative is to extend the trial with further recruitment and test this and other hypotheses using Bayes factors 2,3 . We used this approach with the Drink Less app.
Bayes factors are a measure of strength of evidence and allow researchers to 'top-up' their results from one trial with additional data collected, regardless of the stopping rule, unlike frequentist statistics 2 . The use of Bayes factors supports efficient, incremental model building 3 , as evidence can be continuously accumulated until it is clear whether there is an association or not 2,4 . The rapid accumulation of large amounts of data about digital behaviour change interventions (DBCIs) offers the opportunity to apply emerging methods to their evaluation. DBCIs often have the capacity to continue automatic data collection beyond the end of a trial with little or no additional resources. This paper will illustrate how Bayes factors can be used to optimise a DBCI by updating evidence from an effectiveness trial using the example of Drink Less-an alcohol reduction app.
Bayes factors are the ratio of the average likelihood of two competing hypotheses being correct given a set of data and can overcome some of the issues associated with traditional frequentist statistics 5 . They indicate the relevant strength of evidence for two hypotheses; when evaluating interventions, the two hypotheses are typically the alternative hypothesis (the intervention had the desired effect) and the null hypothesis (the intervention had no effect). Bayes factors, unlike frequentist statistics, can distinguish between two interpretations of a non-significant result: i) support for the null hypothesis of 'no effect' and ii) data are insensitive to detect an effect i.e. 'unsure about the presence of an effect' 5,6 . Calculating Bayes factors to supplement frequentist statistics is a quick and simple procedure with several software packages freely available (e.g. an online calculator developed by Zoltan Dienes 7 ). Researchers are actively encouraged to supplement, or even replace, classical frequentist hypothesis testing with a Bayesian approach to provide greater interpretative value to any non-significant results 8 . This is important as often nonsignificant results are misinterpreted as evidence for no effect; a review of trials conducted in addictions research found that the reporting of 'no difference' was only appropriate in a small number of papers reporting this 9.
The use of Bayes factors also has another major advantage over the traditional frequentist approach that relates to the stopping rule. The traditional frequentist approach necessitates a strict stopping rule and a single analysis of data. Typically, this involves an a priori power calculation to specify the required sample size for data collection and the trial to end at that point. Subsequent 'topping-up' of existing data and re-analysing the new larger data set is 'prohibited' 10 . This is because any p-value between 0 and 1 is equally likely if the null hypothesis is true, regardless of how much data are collected 11 . Therefore, given enough time and data collection, a significant p-value will always be obtained even if the null hypothesis is true 10 . So if researchers find a non-significant result-which cannot distinguish between support for the null hypothesis and being insensitive to detect an effect-then a new study would be required to build on these findings. Restarting the process is a waste of research resources but necessary in the context of using a frequentist approach for analysis because additional data collected cannot be analysed. However, this is not the case when using Bayes factors, as they are driven towards zero when the null hypothesis is true and additional data are collected 10 . Therefore, researchers may use Bayes factors to analyse additional data to complement an employed stopping rule 2 .
In the evaluation of DBCIs, using Bayes factors is beginning to complement traditional frequentist statistics 4,12 , and analysing additional data would be of particular benefit. Data collection for a DBCI effectiveness trial is typically automated and therefore does not require additional resources to continue after a pre-specified sample size is reached. Rapid evaluations of DBCIs and efficient accumulation of evidence can be used to inform future versions, keeping pace with advances in technology. Using Bayes factors to update findings about the relative plausibility of the two hypotheses allows researchers to assess the DBCI's effectiveness in an ongoing manner 4 . This remains useful when deciding about whether there is sufficient evidence to demonstrate effectiveness and, therefore, continued development 13 . To the authors' knowledge, no DBCIs have used additional data collected to supplement original effectiveness trial findings and no trials have used Bayes factors to provide further insight based on additional data. However, Bayes factors have been used in trials for superiority, non-inferiority and equivalence designs to allow for explicit quantification of evidence in favour of the null hypothesis 14 . Bayesian analyses, more generally, are often used in clinical trials for dose finding, efficacy monitoring, toxicity monitoring, and for diagnosis/decision making 15 . For example, Bayesian analyses were used to simultaneously monitor toxicity and efficacy in a parallel phase I/II clinical trial design for combination therapies 16 .

Amendments from Version 1
We thank the reviewers for their thorough and thoughtful comments on the paper. We have addressed all these in the revised manuscript and followed the reviewers' suggestions for wording clarifications. In the Results section, we have added a supplementary table of the participant characteristics for those who responded to follow-up (available at OSF). In the Discussion, we have added a paragraph on the 'value proposition' of the Drink Less app in light of these findings and how the findings informed our decision on which app components to retain or remove. We have also discussed the limitation of the low follow-up rate and an explanation for the difference between the original and extended datasets.

REVISED
DBCIs require novel methods of evaluation that are quick and timely to inform the optimisation of the intervention 17 . The multiphase optimisation strategy (MOST) is a method for building, optimising and evaluating multicomponent behavioural interventions. It involves a series of steps identifying the set of intervention components to be examined and evaluating the effects of these components 13,18 . Factorial trial designs allow the simultaneous evaluation of the intervention components, which enables both the independent and interactive effects to be estimated 13 . Using a factorial trial to evaluate a DBCI can overcome some of the challenges associated with using the traditional randomised controlled trial, such as prolonged duration from recruitment to publication and a high-cost trial implementation 19,20 . The results from a factorial trial can be used to make decisions about which components to retain when optimising the intervention 18 .
The Drink Less smartphone app is a DBCI aimed at supporting people who drink excessively to reduce their alcohol consumption. It was developed using evidence and theory, following MOST. The app was analysed in a full factorial trial to assess the effectiveness of its five intervention modules and their effects on app usage and subsequent usability ratings 21 . The stopping rule for data collection, in line with the frequentist approach to analysis, was pre-specified, although data collection continued under the same conditions as the original factorial trial. Analysis of the original trial data using Bayes factors indicated that the data were insensitive to detect main effects but that combinations of the modules appeared effective 1 .

Aims
The aims of this study are substantive and methodological:

Design
A between-subject full factorial (2 5 ) trial to evaluate the effectiveness of five intervention modules in the Drink Less app. The research questions were specified prior to the trial commencing and pre-registered on ISRCTN (registration number: ISRCTN40104069) and published in an open-access protocol paper 21 .

Participants
Participants were included in the study if they: were aged 18 or over; lived in the UK (only available on UK Apple app store and users had to select 'UK' for 'Country?'); had an AUDIT score of 8 or above (indicative of excessive drinking 22 ); were interested in reducing their drinking (indicated by the question 'why are you using this app?' with users choosing 'interested in drinking less' over 'just browsing'); provided an email address and had downloaded a 'trial version' of the app (described below).
The sample size for the original factorial trial was 672 providing 80% power (with alpha at 5%, 1:1 allocation and a two-tailed test) to detect a mean change in alcohol consumption of 5 units between the 'enhanced' and 'minimal' versions for each intervention module 23 , comparable with a face-to-face brief intervention 24 . This assumed a mean of 27 weekly units at follow-up in the control group, a mean of 22 units in the intervention group and a SD of 23 units for both (d=0.22).
Recruitment was undertaken via promotion from organisations, such as Public Health England, Cancer Research UK, and listing the app in the iTunes Store according to best practices for app store optimisation.

Measures
Baseline measures included the AUDIT questionnaire and a socio-demographic assessment (age, gender, ethnic group, level of education, employment status and current smoking status). The primary outcome measure was self-reported change in past week alcohol consumption (the difference between one-month follow-up and baseline). Past week alcohol consumption was derived from the frequency (Q1) and quantity (Q2) questions of the AUDIT-Consumption (AUDIT-C) questionnaire. The secondary outcome measure was self-reported change in full AUDIT score; in addition to the three questions on consumption in the AUDIT-C, the full AUDIT includes questions assessing harmful alcohol use (e.g. alcohol-related injuries) and symptoms of dependence. Other secondary outcome measures included in the original, full factorial trial were usage data and usability ratings though were not considered in this paper. Details of these measures are described elsewhere 1 , and the data and Bayes Factors calculated are reported on the Open Science Framework (https://osf.io/kqm8b/).

Interventions
The Drink Less app is a DBCI for people who drink excessively to help them reduce their alcohol consumption. It is freely available on the UK version of the Apple App Store for all smartphones and tablets running iOS8 or above. The content of the app did not change during the trial except for minor bug fixes (to ensure compatibility with iOS 10).
The app is structured around goal setting: users can set their own goals based on units, cost, alcohol free days or calories with information on the UK drinking guidelines, units and alcoholrelated harms. There are five intervention modules that aim to help them achieve their goal: Normative Feedback (providing normative feedback on the user's level of drinking relative to others); Cognitive Bias Re-training (a game to retrain approachavoidance bias for alcoholic drinks); Self-monitoring and Feedback (providing a facility for self-monitoring of drinking and receipt of feedback); Action Planning (helping users to undertake action planning to avoid drinking), and Identity Change (promoting a change in identity in relation to alcohol). In the trial version of the app, the five intervention modules existed in two versions: i) an 'enhanced' version containing the predicted active ingredients and ii) a 'minimal' version that acted as a control.
A detailed description of the content, development and factorial trial evaluation of the app is reported in two separate papers 1,25 .

Procedures
Data collection for the factorial trial began on 18 th May 2016 and the required sample of eligible users was reached on 10 th July 2016; follow-up data were collected until 28 th August 2016. Trial data was collected continuously for a further four months until 19 th December 2016 under the same conditions as the original factorial trial (i.e. a 'trial version').
Informed consent to participate in the trial was obtained from all participants on first opening the app. Users who consented to participate completed the AUDIT and a socio-demographic questionnaire, indicated their reason for using the app and provided their email address for follow-up (a prize of £100 was offered in an attempt to decrease the proportion of users leaving this field blank). Users were then provided with their AUDIT score and, those who met the inclusion criteria, were randomised to one of 32 experimental conditions using an automated algorithm within the app for block randomisation.
Follow-up was conducted 28 days after participants downloaded the app and the questionnaire consisted of the full AUDIT and usability measures. Follow-up was conducted in two ways: i) via email with a link to the questionnaire in an online survey tool (Qualtrics), which also sent up to four reminders, and ii) within the app. Participants included according to the original trial and stopping rule were due to complete the follow-up questionnaire up until 29 th August 2016 and were contacted via email (through Qualtrics) and the app. Participants due to complete the follow-up questionnaire from 30 th August onwards, were only contacted via the app.

Ethical approval
Ethical approval for Drink Less from the UCL Ethics Committee under the 'optimisation and implementation of interventions to change health-related behaviours' project (CEHP/2013/508).

Analysis
All analyses were conducted using R version 3.4.0. The analysis plan for this paper followed a similar analysis plan as for the original factorial trial (which was pre-registered on 13 th February 2016; ISRCTN40104069 21 ).
Participant characteristics were reported descriptively by intervention module. A factorial between-subjects design was used to assess the main and two-way interactive effects of the five intervention modules on the primary and secondary outcome measures. Analyses were conducted amongst responders only, those who completed the follow-up questionnaire. Bayes factors were calculated for each analysis assessing the main and the two-way interaction effects of the five intervention modules on the outcome measures. The two-way interactions were defined as enhanced/ enhanced versus minimal/minimal for each pair of intervention modules. The mean difference and standard error of the mean difference for each main and two-way interactive effect was calculated. A half normal distribution was used to specify the predicted effect. Peak at 0 (no effect) with a SD equal to the expected effect size. This is a conservative approach and represents a hypothesis that the intervention had a least some positive effect, with the effect being more likely to be smaller than larger. Bayes factors were calculated using an online calculator 7 .
The expected effect size for the primary calculation of Bayes factors was a reduction of 5 units per week (d=0.22), reflecting a large effect and that of the power calculation for the original factorial trial. Bayes Factors were also calculated for a medium effect (reduction of 3 units per week), and a small effect (reduction of 0.5 units per week) to permit a relative judgment for screening purposes. The expected effect size for the secondary outcome measure was calculated by translating the estimated effect size for the primary outcome measure (d=0.22) into the equivalent mean difference score of 1.45 (mean=19.1, SD=6.56 [based on original trial users, n=672]). Bayes factors will be interpreted in terms of categories of evidential strength (see Table 1) 5,26 .

Study sample
The total sample size was 2586, of these 1914 (74.0%) were additional users to the original factorial trial (672, 26.0%). In total, 342 users (13.2%) completed the primary outcome measure in the follow-up questionnaire-the original users' response rate was 26.6% and the additional users' response rate was 8.5%. Figure 1 shows a flow chart of users throughout the study.
Socio-demographic and drinking characteristics of participants are reported in Table 2. Participants' mean age was 37.2 years, 53.4% were women, 95.8% were white, 74.3% had post-16 qualifications, 87.0% were employed, and 30.0% were current    smokers. Mean weekly alcohol consumption was 39.0 units, mean AUDIT-C score was 9.3, and mean AUDIT score was 19.1, indicating harmful drinking. Participants' characteristics by intervention module are reported in Table 2. Generally, characteristics were similar for the enhanced and minimal version of each intervention module. The characteristics of participants who responded to the follow-up questionnaire (n=342) are reported in Supplementary Table 1.

Change in past week's alcohol consumption
The main effects of the intervention modules are reported in Table 3 for the change in past week's alcohol consumption. Bayes factors showed that the data were insensitive to detect an effect for Normative Feedback for effect sizes of 5-, 3-and 0.5-unit reductions (0.47<BF<0.97). Data were insensitive to detect an effect for Cognitive Bias Re-training for effect sizes of 5-, 3-and 0.5-unit reductions (0.74<BF<1.06). Bayes factors showed that the data were insensitive to detect an effect for Self-monitoring and Feedback for effect sizes of 5-, 3-and 0.5-unit reductions (0.43<BF<0.95). Bayes factors showed that the data were insensitive to detect an effect for Action Planning for effect sizes of 5-, 3-and 0.5-unit reductions (0.83<BF<1.08). Bayes factors for Identity Change showed support for the null hypothesis of no difference between the enhanced and minimal version of the module for a 5-unit reduction (BF=0.22), though data were insensitive to detect an effect for 3-and 0.5-unit reductions (0.34<BF<0.81). The data were insensitive to detect a two-way interactive effect between any pair of intervention modules for effect sizes of 5-, 3-or 0.5-unit reductions (0.35<BF<1.22), except for between Self-monitoring and Feedback and Identity Change for a 5-unit reduction which supported the null hypothesis (BF=0.31) (see Extended data, Supplementary Table 2 27 ).

Change in AUDIT score
The main effects of the intervention modules are reported in Table 4 for the change in AUDIT score. The data were insensitive to detect an effect on change in AUDIT score for: Normative Feedback (BF=0.60); Cognitive Bias Re-training (BF=0.98); and Action Planning (BF=0.95). The data supported evidence for the null hypothesis of no difference in AUDIT score between enhanced and minimal versions of Self-monitoring and Feedback (BF=0.15) and Identity Change (BF=0.14). The two-way interactive effects of intervention modules on change in AUDIT score (see Extended data, Supplementary have some evidence in support of their role of reducing alcohol consumption. Therefore, an additional exploratory analysis was conducted to assess whether there is a larger cumulative effect of the combination of all four modules in the enhanced version compared with the minimal version. This was done for responders only (n=39; 12 "off" vs 27 "on") and for last observation carried forward (n=324; 164 "off" vs 160 "on") to provide potential evidence for what effect size we can expect when planning a definitive trial with longer-term follow-up. Last observation carried forward means that participants' past week alcohol consumption at follow-up was used for all of those who responded to follow-up and the baseline measure for past week alcohol consumption was used for those who did not respond to follow-up. Whilst last observation carried forward has its limitations, it maintains the variability within the data. Table 5 reports the Bayes factors for these analyses. There was a large numerical difference between all enhanced and all minimal for the four modules amongst responders only, although the Bayes factors found that the data were insensitive to detect an effect, which may be due in part to the small sample size.

Discussion
The calculation of Bayes factors for additional data collected beyond the original factorial trial of Drink Less has allowed us to accumulate and update existing evidence on the effectiveness of its intervention components in reducing alcohol consumption. The supplemented data remained insensitive to detect whether the Drink Less app components have large (5-unit) individual or twoway interactive effects on reducing alcohol consumption though tended towards anecdotal evidence for the null hypothesis of no effect. There was evidence of two-way interactive effects in the original factorial trial that is no longer supported by the supplemented data.  The current data also remained insensitive to detect whether the four most promising components (Normative Feedback, Cognitive Bias Re-Training, Self-Monitoring and Feedback and Action Planning) may each have effects smaller than 5 units. An unplanned analysis provided weak anecdotal evidence of a synergistic effect of the 'enhanced' versions of these four intervention modules together. On both past week alcohol consumption and AUDIT score, and across several alternative effect sizes, there was support for no effect of the fifth intervention module, Identity Change. These findings, alongside results from analysing user feedback and usage data on the most frequently visited screens, guided the decision to remove the Identity Change module from the next major app update whilst retaining Normative Feedback and Cognitive Bias Re-Training, and Self-Monitoring and Feedback and Action Planning.
Whilst this study did not find evidence of a large individual effect of any of the intervention modules, there remains some evidence to suggest that an optimised version of the app (with the removal of the Identity Change module) may yet prove effective. As with the original factorial trial, there are concerns that the minimal versions were too active in an attempt to promote engagement amongst all participants. Even participants who were randomised to receive the minimal versions of every intervention module were able to set goals and track their drinks, which is associated with reduced consumption 28 . Most alcohol reduction apps include few techniques to change behaviour 29 suggesting that even the minimal version of Drink Less was more active than most existing alcohol reduction apps. Therefore, effectiveness estimates derived from this approach are likely to be conservative. Furthermore, Drink Less users have excellent levels of engagement with the app 30 , which is necessary (but not sufficient) for an intervention to be effective. Additionally, a content analysis of user feedback (available as a short report here: https://osf.io/d3w8r/) found that of the 'Information giving' category, the majority provided positive feedback on the app as a whole. A sample of the user feedback is available to view on the Drink Less website 31 . Drink Less is also one of the leading alcohol reduction apps in the UK with over 50,000 unique users and an average 4.1-star rating (as of June 2019).
A major strength of this study is its illustration of how it is possible to evaluate data from trials of DBCIs in an on-going manner. No additional resources were required to continue data collection within the original trial of Drink Less as the app remained freely available on the UK Apple app store and the notification to complete the follow-up questionnaire had already been programmed. Analysing the supplemented dataset has allowed us to update our findings and provided more confidence in our original decisions on which components to retain or remove as part of the process of optimising the intervention 18 to improve its effectiveness and usability. We are also much clearer that any definitive trial must be powered to detect small effects and designed to inform a pragmatic decision about whether to invest resources in recommending the app. The optimisation of the Drink Less intervention was based on the findings from this study as well as on user feedback and findings from a meta-analysis of the intervention components in digital alcohol interventions associated with effectiveness 32 . The findings from this study informed the removal of the 'Identity Change' module and retention of the remaining four modules." The stopping rule in frequentist statistics means that additional trial data collected as part of an effectiveness trial for a DBCI would go to waste. The use of Bayes factors in this situation prevents unnecessary waste of resources and enables researchers to continually update their evidence on a DBCI rather than collect and analyse individual data sets as part of separate trials.
A limitation of this study and the use of Bayes factors was that we were not able to use the intention-to-treat (ITT) approach in the analysis (as was done for the original trial), whereby those lost to follow-up (non-responders) were assumed to be drinking at baseline levels. Whilst Bayes factors can overcome a lot of the issues with the frequentist approach, they are not meaningful when assumptions are made that limit the variability in the data. Due to low overall follow-up rates (13.2%) in this larger sample, the ITT assumption that there was no change in the large majority of the sample drives the variability down, which in turn drives support for the null hypothesis. This highlights that Bayes factors were not useful in this study when using the ITT assumption, which limits the variability in the data.
We acknowledge that the follow-up rate is very low and this is likely to be due to the lack of financial incentive for completing the follow-up survey, which are known to increase response rates in randomised trials 33 . Furthermore, the follow-up rate in the extended dataset was lower than for the original trial dataset; this is likely because participants were only contacted via the app for the extended dataset whilst the participants in the original dataset were also contacted via email.
The intervention modules of the Drink Less app do not have a large individual effect on reducing alcohol-related outcomes, though they may have a small effect that the current data were unable to detect. There is weak evidence for a synergistic effect of the 'enhanced' versions of four intervention modules together: Normative Feedback and Cognitive Bias Re-Training, and Self-Monitoring and Feedback and Action Planning. This study has updated the existing evidence on the effectiveness of intervention modules in the Drink Less app. In the event of uncertain results following a primary analysis, Bayes factors can be used to 'top-up' results from DBCI trials with any additional data collected, therefore supporting efficient, incremental model building to inform decision-making. This paper reports the results of a factorial experiment evaluating the Drink Less Smart phone app and its behaviour change components, with the benefit of an extended data-set. The aims of the study were both substantive and methodological. In relation to the latter, the paper provides a very useful example of the advantages of Bayesian hypothesis testing and in particular its legitimate provision for extending data collection until a firm conclusion has been reached. The paper also provides a good example, with earlier papers, of the MOST method for developing and evaluating behaviour change interventions.

Data availability
Regarding the substantive aim, I have two major and several minor comments.

MAJOR:
Analysis of the full data-set after extension confirmed the mainly inconclusive results of tests of the 5 individual components of the app in the original evaluation study and failed to confirm the previously reported evidence in favour of two interaction effects between components. In view of the authors' thorough and painstaking development of the app over the years, their rigorous evaluations of the components of the app and their stated intention to optimise the app for a definitive test in an RCT with extended follow-up, these results must be regarded as very disappointing. Yet this is not commented on and no possible explanation is offered why components whose inclusion was supported strongly by theory and previous research failed to show effects on drinking behaviour. Is it possible, as the authors have previously suggested, that the 'minimal', control interventions were too active to allow an effect to emerge? What other explanations are there for these disappointing results? Overall, where do the authors go from here in the attempt to bring this very promising intervention technology to practical use? In the Abstract the authors state: 'There was weak evidence for a synergistic effect of four components'. I feel even this is too strong. In the text on p.9 we have: An unplanned analysis ' provided weak anecdotal evidence of a synergistic effect of the 'enhanced' versions of these four 2. 1.

4.
provided weak anecdotal evidence of a synergistic effect of the 'enhanced' versions of these four intervention modules together'. Something along these lines, with the inclusion of 'unplanned' and 'anecdotal' would be more appropriate for the Abstract.
Another concern here is that the putative effect in question derives from a post hoc hypothesis based on unplanned comparisons arrived at only after the extension of data collection. In a frequentist approach to hypothesis testing, this would not of course be permissible but, even in the Bayesian approach, there must surely be some constraints on the legitimacy of testing post hoc hypotheses derived from exploratory analyses (rather than using such analyses to generate hypotheses not tested in the present data). The authors should comments on this issue and, if necessary, seek expert advice.

MINOR:
Why was the follow-up rate so much lower in the extended data-set that in the original data (8.5% versus 26.6%)? Can the authors attempt to explain this difference? p.8: '... to provide potential evidence for what effect size we can expect when planning the trial.' This is presumably the definitive trial with extended follow-up but this is not clear. What about the data on secondary outcomes? These are not reported here but it is not stated that they will not be considered in this paper and the reader is not told where they will be found. The primary outcome measure of self-reported change in past week alcohol consumption was presumably based on the AUDIT-C questionnaire, as suggested by Table 2, but this is not made clear in the text.

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
Reviewer Expertise: I have long experience in the area of research on brief interventions for hazardous and harmful alcohol consumption. However, although I have authored a publication using the Bayesian approach to hypothesis testing, I am by no means an expert on the use of Bayesian statistics.

I confirm that I have read this submission and believe that I have an appropriate level of I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 04 Jul 2019 , University College London, London, UK Claire Garnett Analysis of the full data-set after extension confirmed the mainly inconclusive results of tests of the 5 individual components of the app in the original evaluation study and failed to confirm the previously reported evidence in favour of two interaction effects between components. In view of the authors' thorough and painstaking development of the app over the years, their rigorous evaluations of the components of the app and their stated intention to optimise the app for a definitive test in an RCT with extended follow-up, these results must be regarded as very disappointing. Yet this is not commented on and no possible explanation is offered why components whose inclusion was supported strongly by theory and previous research failed to show effects on drinking behaviour. Is it possible, as the authors have previously suggested, that the 'minimal', control interventions were too active to allow an effect to emerge? What other explanations are there for these disappointing results? Overall, where do the authors go from here in the attempt to bring this very promising intervention technology to practical use? This is a very good point made by both reviewers that we do not sufficiently discuss the value proposition of Drink Less in our discussion section. We have added a paragraph on this: "Whilst this study did not find evidence of a large individual effect of any of the intervention modules, there remains some evidence to suggest that an optimised version of the app (with the removal of the 'Identity Change' module) may yet prove effective. As with the original factorial trial, there are concerns that the minimal versions were too active in an attempt to promote engagement amongst all participants. Even participants who were randomised to receive the minimal versions of every intervention module were able to set goals and track their drinks, which is associated with [1] reduced consumption . Most alcohol reduction apps include few techniques to change [2] behaviour suggesting that even the minimal version of Drink Less was more active than most existing alcohol reduction apps. Therefore, effectiveness estimates derived from this approach are likely to be conservative. Furthermore, Drink Less users have excellent levels of engagement with [3] the app , which is necessary (but not sufficient) for an intervention to be effective. Additionally, a content analysis of user feedback (available as a short report here: https://osf.io/d3w8r/) found that of the 'Information giving' category, the majority provided positive feedback on the app as a whole.
[4] A sample of the user feedback is available to view on the Drink Less website . Drink Less is also one of the leading alcohol reduction apps in the UK with over 50,000 unique users to date with an average 4.1-star rating (as of June 2019)." In the Abstract the authors state: 'There was weak evidence for a synergistic effect of four components'. I feel even this is too strong. In the text on p.9 we have: An unplanned analysis ' provided weak anecdotal evidence of a synergistic effect of the 'enhanced' versions of these four intervention modules together'. Something along these lines, with the inclusion of 'unplanned' and 'anecdotal' would be more appropriate for the Abstract.
We have re-worded the abstract to be cautious with our conclusions: "In an additional exploratory analysis, participants receiving four of the components averaged a numerically greater reduction in consumption than those not receiving any (21.6 versus 12.1 units), but the data were insensitive (BF=1.42). Data from extended recruitment in a factorial experiment evaluating components of the Drink Less app remained insensitive but tended towards individual and pairs of components not having a large effect. In an exploratory analysis, there was weak anecdotal evidence for a synergistic effect of four app remained insensitive but tended towards individual and pairs of components not having a large effect. In an exploratory analysis, there was weak anecdotal evidence for a synergistic effect of four components. In the event of uncertain results, calculating BFs can be used to update the strength of evidence of a dataset supplemented with extended recruitment." Another concern here is that the putative effect in question derives from a post hoc hypothesis based on unplanned comparisons arrived at only after the extension of data collection. In a frequentist approach to hypothesis testing, this would not of course be permissible but, even in the Bayesian approach, there must surely be some constraints on the legitimacy of testing post hoc hypotheses derived from exploratory analyses (rather than using such analyses to generate hypotheses not tested in the present data). The authors should comments on this issue and, if necessary, seek expert advice. These analyses were exploratory rather than testing a post-hoc hypothesis and the Bayesian approach does not have the same limitations as the frequentist approach in conducting exploratory analyses. Also, the conclusions match the strength of the evidence for this exploratory analysis (weak and anecdotal). We plan to pre-register any hypotheses and analysis plans in future studies. Furthermore, a key part of this study was for it to be informative in making decisions on which components to retain in an optimised version of the app and in terms of conducting a power analysis for a definitive trial.
Why was the follow-up rate so much lower in the extended data-set that in the original data (8.5% versus 26.6%)? Can the authors attempt to explain this difference? We have included a discussion of this in the limitations: "We acknowledge that the follow-up rate is very low and this is likely to be due to the lack of financial incentive for completing the follow-up survey, which are known to increase response rates in randomised trials (Brueton et al., 2014). Furthermore, the follow-up rate in the extended dataset was lower than for the original trial dataset; this is likely because participants were only contacted via the app for the extended dataset whilst the participants in the original dataset were also contacted via email." p.8: '... to provide potential evidence for what effect size we can expect when planning the trial.' This is presumably the definitive trial with extended follow-up but this is not clear. We have clarified this: "…to provide potential evidence for what effect size we can expect when planning a definitive trial with longer-term follow-up." What about the data on secondary outcomes? These are not reported here but it is not stated that they will not be considered in this paper and the reader is not told where they will be found. The data on the other secondary outcomes (usage data and usability ratings) were not reported in this paper though we have added information on where they will be found: "Other secondary outcome measures included in the original, full factorial trial were usage data and usability ratings though were not considered in this paper. Details of these measures are described elsewhere 1, and the data and Bayes Factors calculated are reported on the Open Science Framework (https://osf.io/kqm8b/)." The primary outcome measure of self-reported change in past week alcohol consumption was presumably based on the AUDIT-C questionnaire, as suggested by Table 2, but this is not made clear in the text. We have clarified this in the measures section: "The primary outcome measure was self-reported change in past week alcohol consumption (the difference between one-month follow-up and baseline). Past week alcohol consumption was 11.

14.
Results: It would be of interest to discuss further the difference in AUDIT and AUDIT-C score and the role the final two questions (risk taking etc) play; Discussion: "no additional resources were required" -is this the case, was the app provisioned for longer than anticipated?; Discussion: "our decision on which components to retain or remove" -a bit more discussion round this aspect would be helpful to the reader; Discussion: A 13.2% follow-up rate appears to be very low, do the authors have any reasons for this?; Is the rationale for developing the new method (or application) clearly explained? Partly

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly Researcher on the InDEx app project -an app designed to help armed forces Competing Interests: personnel monitor their alcohol consumption Reviewer Expertise: Mobile health with a focus on alcohol misuse I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
there are concerns that the minimal versions were too active in an attempt to promote engagement amongst all participants. Even participants who were randomised to receive the minimal versions of every intervention module were able to set goals and track their drinks, which is associated with [1] reduced consumption . Most alcohol reduction apps include few techniques to change [2] behaviour suggesting that even the minimal version of Drink Less was more active than most existing alcohol reduction apps. Therefore, effectiveness estimates derived from this approach are likely to be conservative. Furthermore, Drink Less users have excellent levels of engagement with [3] the app , which is necessary (but not sufficient) for an intervention to be effective. Additionally, a content analysis of user feedback (available as a short report here: https://osf.io/d3w8r/) found that of the 'Information giving' category, the majority provided positive feedback on the app as a whole.
[4] A sample of the user feedback is available to view on the Drink Less website . Drink Less is also one of the leading alcohol reduction apps in the UK with over 50,000 unique users to date with an average 4.1-star rating (as of June 2019)." Abstract: "Amongst responders only" -is that the sample who took part in the follow-up questionnaire?; Yes, this has been clarified in the abstract with "…amongst responders only (those who completed the questionnaire)." Abstract: "Unplanned comparison" appears to convey a negative connotation the authors could alter to "additional analyses"; We have reworded this to "additional exploratory analysis" throughout the manuscript.
Abstract: "four most promising" could be misleading as you only had five components but we also have to be mindful that the data was insensitive; We have re-worded the abstract to be cautious with our conclusions: "In an additional exploratory analysis, participants receiving four of the components averaged a numerically greater reduction in consumption than those not receiving any ( Methods -Participants: "were interested in reducing their drinking" how was this measured? Were participants research aware, or were they targeted because they had previously stated an interest in reducing alcohol? Or could it be by downloading DrinkLess they were assumed to be interested in reducing their alcohol consumption?; Methods -Participants: Was a geolocation restriction placed on participants? How can you be sure that users were from the UK?; We have clarified details on the participants in the methods section: "Participants were included in the study if they: were aged 18 or over; lived in the UK (only available on UK Apple app store and users had to select 'UK' for 'Country?'); had an AUDIT score of 8 or above (indicative of excessive drinking); were interested in reducing their drinking (indicated by the question 'why are you using this app?' with users choosing 'interested in drinking less' over 'just browsing'); provided an email address and had downloaded a 'trial version' of the app (described below)." Methods -Intervention: What were the minor bug fixes, is a summary able to be provided as a supplement?; As the bug fixes are minor, we have included a brief explanation in the manuscript: "The content of the app did not change during the trial except for minor bug fixes (to ensure compatibility with iOS 10)." Results: Results present AUDIT-C score, however this is not discussed previously. We have clarified this in the measures section: "The primary outcome measure was self-reported change in past week alcohol consumption (the difference between one-month follow-up and baseline). Past week alcohol consumption was derived from the frequency (Q1) and quantity (Q2) questions of the AUDIT-Consumption (AUDIT-C) questionnaire." Results: It would be helpful to have Table 2 represented as supplementary material for those who took part in follow-up; We have added a supplementary table of the participant characteristics for those who responded to follow-up.
Results: It would be of interest to discuss further the difference in AUDIT and AUDIT-C score and the role the final two questions (risk taking etc) play; We have discussed the differences between the AUDIT-C and AUDIT scores in the methods section: "The secondary outcome measure was self-reported change in full AUDIT score; in addition to the three questions on consumption in the AUDIT-C, the full AUDIT includes questions assessing harmful alcohol use (e.g. alcohol-related injuries) and symptoms of dependence." Discussion: "no additional resources were required" -is this the case, was the app provisioned for longer than anticipated?; The app was always intended to be available for the long-term (i.e. not removing it after completion of the trial). We have added more explanation to this section of the discussion: "No additional resources were required to continue data collection within the original trial of Drink "No additional resources were required to continue data collection within the original trial of Drink Less as the app remained freely available on the UK Apple app store and the notification to complete the follow-up questionnaire had already been programmed." Discussion: "our decision on which components to retain or remove" -a bit more discussion round this aspect would be helpful to the reader; We have elaborated on this in the discussion: "Analysing the supplemented dataset has allowed us to update our findings and provided more confidence in our original decisions on which components to retain or remove as part of the process of optimising the intervention to improve its effectiveness and usability. We are also much clearer that any definitive trial must be powered to detect small effects and designed to inform a pragmatic decision about whether to invest resources in recommending the app. The optimisation of the Drink Less intervention was based on the findings from this study as well as on user feedback and findings from a meta-analysis of the intervention components in digital alcohol [9] interventions associated with effectiveness . The findings from this study informed the removal of the 'Identity Change' module and retention of the remaining four modules." Discussion: A 13.2% follow-up rate appears to be very low, do the authors have any reasons for this?; We have included a discussion of this in the limitations: "We acknowledge that the follow-up rate is very low and this is likely to be due to the lack of financial incentive for completing the follow-up survey, which are known to increase response rates in randomised trials (Brueton et al., 2014). Furthermore, the follow-up rate in the extended dataset was lower than for the original trial dataset; this is likely because participants were only contacted via the app for the extended dataset whilst the participants in the original dataset were also contacted via email." No competing interests were disclosed.

Competing Interests:
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com