Improving open and rigorous science: ten key future research opportunities related to rigor, reproducibility, and transparency in scientific research

Background: As part of a coordinated effort to expand research activity around rigor, reproducibility, and transparency (RRT) across scientific disciplines, a team of investigators at the Indiana University School of Public Health-Bloomington hosted a workshop in October 2019 with international leaders to discuss key opportunities for RRT research. Objective: The workshop aimed to identify research priorities and opportunities related to RRT. Design: Over two-days, workshop attendees gave presentations and participated in three working groups: (1) Improving Education & Training in RRT, (2) Reducing Statistical Errors and Increasing Analytic Transparency, and (3) Looking Outward: Increasing Truthfulness and Accuracy of Research Communications. Following small-group discussions, the working groups presented their findings, and participants discussed the research opportunities identified. The investigators compiled a list of research priorities, which were circulated to all participants for feedback. Results: Participants identified the following priority research questions: (1) Can RRT-focused statistics and mathematical modeling courses improve statistics practice?; (2) Can specialized training in scientific writing improve transparency?; (3) Does modality (e.g. face to face, online) affect the efficacy RRT-related education?; (4) How can automated programs help identify errors more efficiently?; (5) What is the prevalence and impact of errors in scientific publications (e.g., analytic inconsistencies, statistical errors, and other objective errors)?; (6) Do error prevention workflows reduce errors?; (7) How do we encourage post-publication error correction?; (8) How does ‘spin’ in research communication affect stakeholder understanding and use of research evidence?; (9) Do tools to aid writing research reports increase comprehensiveness and clarity of research reports?; and (10) Is it possible to inculcate scientific values and norms related to truthful, rigorous, accurate, and comprehensive scientific reporting? Conclusion: Participants identified important and relatively unexplored questions related to improving RRT. This list may be useful to the scientific community and investigators seeking to advance meta-science (i.e. research on research).


Introduction
Rigor, reproducibility, and transparency (RRT) are scientific cornerstones that promote truthful, accurate, and objective science (McNutt, 2014). In the context of scientific research, rigor is defined as a thorough, careful approach that enhances the veracity of findings (Casadevall & Fang, 2012). There are several types of reproducibility, which include the ability to evaluate and follow the same procedures as previous studies, obtain comparable results, and draw similar inferences (Goodman et al., 2016; National Academies of Sciences, 2019). Transparency is a process by which methodology, experimental design, coding, and data analysis tools are reported clearly and openly shared (Nosek et al.,, 2015;Prager et al., 2019). Together, these scientific norms represent the best means of obtaining objective knowledge of the world (Anderson et al., 2010;Allison et al., 2016). The science concerning these norms is a specific branch of meta-science, or "research on research", led by scientists who promote these values through the education of early career scientists, identifying areas of concern for scientific validity, and postulating paths toward stronger, more credible science (Ioannidis et al., 2015).
Several factors compete with the pursuit of rigorous, reproducible, and transparent research. For example, the rate of scientific publication has risen dramatically in the last two decades. Although this is indicative of many important scientific breakthroughs (Van Noorden, 2014), the rate of manuscript retractions due to either researcher error or malfeasance has also increased (Steen et al., 2013). A survey found between 40% and 70% of scientists agreed that factors including fraud, selective reporting, and pressure to publish contribute to the irreproducibility of scientific findings (Fanelli, 2018). These concerns also have the potential to decrease public trust in science, although research on this question is needed (National Academies of Sciences, 2017).
Basic and applied science are undermined when scientists fail to uphold high standards of conduct (Prager et al., 2019). Given that many authors have identified issues or concerns in science, the emerging challenge for scholars in this area is to find workable solutions to improve RRT, rather than simply continuing to illustrate problems related to RRT (Allen & Mehler, 2019). To this end, in October 2019, Indiana University School of Public Health-Bloomington hosted a multidisciplinary meeting of leading scholars to discuss ongoing RRT-related challenges. The purpose of the meeting, which was funded by the Afred P. Sloan Foundation, was to identify new opportunities to advance sound scientific practice, from the early stages of planning a study, through to execution and the communication of findings. This paper presents findings from that meeting.

Methods
The meeting was structured around three areas: (1) Improving education & training in RRT.
(2) Reducing statistical errors and increasing analytic transparency.
(3) Looking outward: increasing truthfulness and accuracy of research communications.

Participants
We invited participants based on prior contributions to RRT research. Participants included representatives from several leading organizations and Indiana University (IU) faculty, staff, and graduate students who were invited to participate in the meeting and proceedings (Table 1). For their participation in the meeting, invited guests who were not federal employees or IU employees received a $1,000 honorarium.

Meeting format
The two-day meeting was comprised of nine prepared research talks, moderated panel discussions, and small-group openforum style sessions related to each of the three previously stated goals.
Day one. On the first day, participants presented 10-12 minute research talks, each of which was followed by a moderated question-and-answer period. Participants discussed questions pertaining to RRT and sought to identify emerging areas of research including novel approaches, testable outcomes, and potential limitations. During the afternoon session, participants were divided into three small-groups to discuss potential research opportunities, moderated by an IU faculty representative charged with compiling notes for record keeping and dissemination.
Day two. On the second day, one representative from each group summarized major points through a brief presentation, which was followed by a question-and-answer session with all participants. This dialogue was intended to clarify ideas raised and to identify fundable research opportunities. The meeting concluded with a call to action by the Dean of the School of Public Health-Bloomington and Co-Principal Investigator of the project (DA), to continue promoting interdisciplinary RRT Science.

Subgroup 1: improving education & training in RRT
We asked the first subgroup to discuss research opportunities related to implementing and testing RRT-guided academic curricula. The group identified elements of current undergraduate and graduate education that contribute to problematic data practices, including possible underlying causes and potential solutions (see Table 2). Three primary education-related questions guided the discussion: (1) Can RRT-focused statistics and mathematical modeling courses improve statistical practice?
(2) Can specialized training in scientific writing improve transparency?
(3) Does modality affect the efficacy of RRT-related education? 2. There are currently limited existing graduate level curricula that pertain exclusively to writing.

Does modality affect the efficacy of RRT-related education?
1. Feasibility concerns including, cost, time, and other additional resources needed to facilitate an intervention.
2. Examining heterogeneity requires large and diverse populations, and is practically difficult.
Subgroup 2: reducing statistical errors and increasing analytic transparency 4. Can automation help identify errors more efficiently? 1. Automation may be technically possible for only certain types of errors.
2. New programs intended to automate error correction require a certain level of computer programming expertise.

What is the prevalence and impact of errors?
1. It would be difficult to generalize the prevalence of errors, because many common errors have field-specific names.
2. Assessing the impact of errors is largely subjective, unless strict guidelines are agreed upon and adopted.
6. Do error prevention workflows reduce errors? 1. It would be difficult to determine if workflows are entirely responsible for reduced error and improved research practice.
2. It may be challenging to identify generalizable workflows that logically function across disciplines.

How do we encourage post-publication error correction?
1. It would be difficult to implement standard post-publication error correction guidelines that function effectively across disciplines.
2. There is a hesitancy to embrace error correction as a normal component of the editorial process.  1 We present here only two of the most salient challenges.
With respect to each question the existing and entrenched practices, feasibility of change, and proper audience for interventions were discussed.

Can RRT-focused statistics and mathematical modeling courses improve statistical practice?
Incorrect analyses are some of the most common, preventable errors in science (Resnik, 2012). Scholars attribute mistakes to gaps in statistics education (Thompson, 2006 Participants identified several RRT-specific writing principles and discussed how a deeper understanding of the extent to which writing and research are intertwined may increase transparency. Examples included learning about methodological reporting guidelines, writing compelling post-publication peer reviews, and other transparent writing practices. The group also discussed how courses could be developed or redesigned specifically to center on RRT principles. One theme of the discussion was the need for rigorous testing of student learning outcomes associated with novel writing content. However, a primary concern was the identification of the appropriate outcome measures for writing-specific interventions (Barnes et al., 2015) given the subjective and nebulous nature of constructs like writing quality, individual improvement, and writing-related self-efficacy.

Does modality affect the efficacy of RRT-related education?
Another research opportunity discussed by the subgroup related to instructional modality, which refers to the manner in which a curriculum or intervention is experienced by the learner ( c) Which modality is most effective and among which audiences?
In the context of previously discussed coursework in statistics and writing, participants explored the strengths and weaknesses of various modalities and how interventions could be conducted to test them empirically. There are logistical considerations, such as cost, space, and faculty time, that further complicate the feasibility of these interventions. For example, a face-to-face intervention may offer more tailored instruction to individual learners, while an online intervention may better deliver content to a wider audience. Thus, the subgroup identified several areas for future research, including comparisons of student learning across modalities, strategies for scaling educational content to institutional constraints, and the moderating effects of learner demographics on intervention efficacy. (5) What is the impact of errors within disciplines?; (6) Do standardized procedures (i.e., workflows) prevent errors?
(7) How do we encourage post-publication error correction?
The costs and benefits associated with each question were also discussed (see Table 2). ). An increase in automation (i.e., producing more user-friendly tools and algorithms) has the potential for surveilling the prevalence, prevention, and correction of errors. However, more work is needed to determine the most efficient use of such tools, including their collective abilities to detect fieldspecific issues that require subject matter expertise (Lakens & Debruine, 2020). For example,the automatic recomputation of some p-values is possible using the program 'Statcheck', but only for articles that utilize the American Psychological Asso . The subgroup discussed opportunities to define, and possibly automate, diagnostic checklists, advanced natural language processing, or other computational informatics approaches that would facilitate the detection of these errors. These novel automated measures could be tested empirically for effectiveness. To achieve the goal of error reduction, one must first know how pervasive errors are. Yet, it remains challenging to generalize the detection and correction of scientific errors across disciplines because of field specificity (i.e. the unique nuances and methodological specificities inherent to a specific field of study) (Lohse et al., 2020), the various terminologies used for describing the same models (e.g. 'Hierarchical Linear' models vs 'Multilevel' models), as well as the seeming need to repackage the same problem as new disciplines arise (e.g. ongoing multiple comparison issues raised anew with the advent of genome-wide association studies, microarray, microbiome, and functional magnetic resonance imaging methods). Thus, this subgroup discussed the value of longitudinal, disciplinespecific error surveillance and error frequency estimation to collect empirical evidence about error rate differences among disciplines. Other issues discussed were the identification of better prevalence estimates across fields, and how simulation studies can modify our confidence in the understanding of the prevalence of errors and their generalizability across disciplines.

Do error prevention workflows reduce errors?
Workflows are the various approaches for accomplishing scientific objectives, usually expressed as tasks and dependencies (Ludäscher et al., 2009). The implementation of clear, logical workflows can potentially prevent errors and improve research transparency. Workflows may be of value to catch errors at various stages of the research process, from planning, to data collection and handling procedures, and reporting/manuscript screening (Cohen-Boulakia et al., 2017). Error detection processes within scientific workflows may serve as mechanisms to prevent errors before publication, akin to how text duplication software (e.g. iThenticate) is used prophylactically to catch inadvertent plagiarism. Separately, some research groups implement workflows that require two independent scientists to verify data, analyses, and statistical reporting prior to manuscript publication, with at least one of those individuals being a professional statistician (George et al., 2016). A similar workflow is to establish "red teams", consisting of methodologists, statisticians, and subject-matter experts, to critique the study design and analysis for errors, offering incentives akin to "bug bounty" programs in computer software development (Lakens, 2020).
The development and dissemination of research workflows could be modeled after those outlined above, or in other ways such as the use of checklists to complete work systematically. ), few have been tested empirically. The subgroup debated how journals and their editors could be part of empirically tested trials on best approaches to facilitate correction and minimize the incurring of additional costs. For example, based on our experiences, journals have few procedures for handling errors separate from typical scholarly dialogue. We believe it is important to examine which procedures are more efficient and fair to authors, whether such procedures can be standardized to enable editors to handle different types of errors consistently and transparently, whether correction mechanisms are sufficient or require additional innovation (e.g. retraction and republication is sufficient or versioning), and how authors can be supported and encouraged in the process. Three such costs that require further study include the actual cost of post-publication error correction across all parties involved (e.g. page charges, salary), how those costs to the scientific enterprise compare to implementing prevention strategies, and the cost-benefit of salvaging a publication containing an error depending on the quality of the collected data versus simply retracting.

Subgroup 3 -looking outward: increasing truthfulness and accuracy of research communications
The third working group discussed opportunities for research related to research reporting and dissemination, primarily highlighting the importance of accuracy and truthfulness when communicating research findings (see Table 2). Specifically, this group identified research opportunities tied to the following questions:  Participants agreed that ethics and responsibility are vital across scientific disciplines, yet graduate research often neglects the philosophy of science and the formation of professional identity as a scientist. Instead, training tends to focus on the technical skills needed to conduct experiments and analyze data in specific disciplines (Bosch, 2018; Bosch & Casadevall, 2017). Technical skills are essential to produce good science; to apply them ethically and responsibly, however, it is paramount that scientists also endorse scientific values and norms. Participants identified a need for research to determine how these scientific values could be inculcated in scientists and how scientists should be taught to enact those values in their research.

Conclusion
Scientists slow the pursuit of truth when research is not rigorous, reproducible, or transparent (Collins & Tabak, 2014).
To improve the state of science, RRT leaders have long raised concerns about many of the current challenges the scientific enterprise faces by identifying novel strategies intended to uphold and improve scientific validity. Discussions among RRT leaders at Indiana University Bloomington reinforce the value and importance of promoting accurate, objective, and truthful science. The proposal, execution, and evaluation of the ideas presented herein showcases how the collective and interdisciplinary efforts of those investing in the future of science can solve problems in unique and exciting ways.

Data availability
No data are associated with this article.
All participants have provided their permission to be named in this article.
wholly on statistics, the interpretation of data in papers published in the biological sciences does not always require sophisticated statistical analyses; rather, diligent data reporting and transparency is essential.

Conclusion:
The authors summarize with "proposal, execution, and evaluation of the ideas presented herein showcases how the collective and interdisciplinary efforts of those investing in the future of science can solve problems in unique and exciting ways". While appreciating this forward looking statement, the message is clear: the issue of reproducibility in science is complex and will continue to be debated and discussed in workshops such as this manuscript describes in the coming years. The article "Improving open and rigorous science...." is a report out on a workshop intended to make recommendations on improving rigor, reproducibility, and transparency (RRT) in interdisciplinary science. The idea of peer reviewing a workshop report is a bit of a curious assignment. What's a reviewer to say? No, those weren't the best topics to debate at your workshop, please reconvene and discuss something else? Raise questions about whether the article faithfully reports the workshop deliberations and consensus, when the reviewer wasn't there? As such this review is rather limited. The article reads well and has clearly been well vetted by the authors. The workshop and paper are interdisciplinary, although the focus is strongly slanted toward biomedical research and the health sciences.

Not all errors are mistakes
My only criticism of substance is the use of the term "statistical errors." Consider replacing it with "statistical mistakes" throughout the manuscript. In many fields, including mine (environmental science), the word "error" could refer to variability in the data, such as "the standard error of the mean." In other contexts, the word error is often used to describe the limits of precision. DNA and cells replicate with small errors, which over time lead to aging and senescence. In analytical chemistry, deviations from instrument values for calibration or quality control samples may be termed measurement error. Measurement error might refer to the inherent limits of a sensor in the instrument or the combined errors of the method. For example, in a bathymetric survey, errors accrue from inherent limits in the measuring distance as a function of sound through water, temperature changes in the water introduce error, a breeze adding motion to the boat introduces error, plants growing on the bottom muddy the signal increasing error, imprecision in the Earth's spheroid and canyon walls interfere with the GPS, and on and on. The hydrologist tries to reflect the accumulated error with a margin of error statement on overall accuracy. Those are examples of error -something the scientist always seeks to reduce and to accurately report the uncertainties associated with measurements, modeling, etc., but the presence of error is unavoidable. A mistake on the other hand is a blunder. Attaching the bathymetric sensor backwards, entering the wrong units into the calculations, using a long-wave, deep ocean sensor in shallow water, using the wrong datum, using a poorly suited method, neglecting calibrations, .... Just as with statistical mistakes, the topic of the argument, while there are often different appropriate methods of measurement for just about any scientific setting, some controversial or debatable methods, and some that are just plain wrong. The focus of the authors is on the latterhelping scientists avoid statistical blunders that are just plain wrong. I strongly urge you to call these "statistical mistakes" which is less ambiguous than "errors." There are supposed to be interdisciplinary RRT recommendations.

Minor suggestions
p7., in subsection titled "5. What is the prevalence and impact of errors," I thought the second paragraph was particularly dense and probably impenetrable to those not already in the know: : "Thus, [Subgroup 2] discussed the value of longitudinal, discipline-specific error surveillance and error frequency estimation to collect empirical evidence about error rate differences among disciplines. Other issues discussed were the identification of better prevalence estimates across fields, and how simulation studies can modify our confidence in the understanding of the prevalence of errors and their generalizability across disciplines." That's all. This was a tightly written report out of the workshop. Thank you for considering my rant about mistakes versus errors, where depending on the field and context, the latter is often a neutral descriptor of uncertainty.
Is the topic of the opinion article discussed accurately in the context of the current literature? Yes

Are arguments sufficiently supported by evidence from the published literature? Yes
Are the conclusions drawn balanced and justified on the basis of the presented arguments? Yes formulated, though with the small size of the meeting and the limited number of invited participants outside of the university host, it is difficult to say whether the discussions, presented in a very succinct format of key challenges, is representative of all of the issues or viewpoints on the topic. Nevertheless, this appears to have been a good discussion that raised significant challenges. I would have preferred to see a bit more focus on solutions, as the challenges raised are all daunting.

Introduction:
Regarding the statement that 40-70% of scientists agreed on factors contributing to irreproducibility, the original citation be used, (Baker, 2016; added 1 ). Also, reference to the funder for the meeting is very much appreciated -but it is "Alfred P Sloan" not "Afred". In the last sentence, ""through to execution" is unwieldy -either "through" or "to" works but no need for both.

Methods:
I very much appreciate the list of participants and acknowledgement of honorariums -kudos on the transparency! I also appreciate knowing who participated in the small groups, but it would have been nice to see the agenda or titles of the Day One research presentations. Were those research or meta-research presentations? Also, "small-groups" should not be hyphenated, in fact you could just say three groups and let the reader come to their own conclusion about size; "breakout" is another useful term.

Results:
Subgroup 1, first paragraph: the following wording could be more precise by changing "three primary education-related questions" (where primary modifies education and not questions) to "three primary questions, education-related," or something similar. Precision of language is one of the articulated goals of training and communication in this article! Q5, 2nd paragraph: I disagree with the first sentence, "To achieve the goal of error reduction, one must first know how pervasive errors are." I think any reduction in errors is a win, even without understanding the entire landscape, and needing to fully understand the landscape before attempting solutions is just kicking the can down the road. It's the "measurement" of error reduction or assessing progress toward a particular goal (which is not articulated) that requires knowing the pervasiveness first, and I agree that is extremely difficult to measure. Q7, 2nd paragraph, last sentence: I question whether understanding "salary" costs of error correction is a valid pursuit, whether it's a case of pay now or pay later; page charges are a different matter.

Conclusion:
Since the Methods section stated that the meeting ended with a "call to action" to continue promoting interdisciplinary RRT science, I wonder if that call to action is accurately summarized? I found a great summary of the discussion but didn't walk away with a clearly articulated call to action in the very brief conclusion.

General Comments:
I tend to agree that the challenges are many and difficult, though the small group discussions are distilled down to two challenges per question. They are mainly framed in negative terms, which is hard to read as a "call to action" without more detail. Nonetheless, the challenges raised are important and should be addressed, I'm just left scratching my head on what the next step is for many of these, given how they are stated.
I note that many of the references are from participants at the meeting, which may reflect the meeting content (difficult to judge without seeing the agenda), but does not necessarily instill in others an unbiased approach; this is perhaps a limitation of a small-meeting-by-invitation and could be formally recognized in the paper. This is not a value judgement on the references, indeed there is some balance, but it is a selected view that focuses on the meeting participants.