Efforts to enhance reproducibility in a human performance research project

Background: Ensuring the validity of results from funded programs is a critical concern for agencies that sponsor biological research. In recent years, the open science movement has sought to promote reproducibility by encouraging sharing not only of finished manuscripts but also of data and code supporting their findings. While these innovations have lent support to third-party efforts to replicate calculations underlying key results in the scientific literature, fields of inquiry where privacy considerations or other sensitivities preclude the broad distribution of raw data or analysis may require a more targeted approach to promote the quality of research output. Methods: We describe efforts oriented toward this goal that were implemented in one human performance research program, Measuring Biological Aptitude, organized by the Defense Advanced Research Project Agency's Biological Technologies Office. Our team implemented a four-pronged independent verification and validation (IV&V) strategy including 1) a centralized data storage and exchange platform, 2) quality assurance and quality control (QA/QC) of data collection, 3) test and evaluation of performer models, and 4) an archival software and data repository. Results: Our IV&V plan was carried out with assistance from both the funding agency and participating teams of researchers. QA/QC of data acquisition aided in process improvement and the flagging of experimental errors. Holdout validation set tests provided an independent gauge of model performance. Conclusions: In circumstances that do not support a fully open approach to scientific criticism, standing up independent teams to cross-check and validate the results generated by primary investigators can be an important tool to promote reproducibility of results.


Introduction
Reproducibility of findings is a fundamental requirement of any scientific research endeavor.Nevertheless, for a variety of reasons, reproducibility remains a challenge in many areas of the life sciences. 1Batch effects, hidden variables, and low signal-to-noise ratios may interfere with researchers' ability to draw broad conclusions based on small quantities of data. 2 Similarly, data snooping may, even unintentionally, lead to the selection of models that have poor generalizability outside an original dataset. 3These difficulties can be exacerbated in exploratory studies that are by design not limited to the consideration of only one or a small number of pre-specified hypotheses, but rather constructed for the purpose of testing a very large number of possible explanatory variables using a statistical approach.
Transparency in data collection and analysis has been suggested as a potential means of bringing to light methodological or other flaws that may impair the reproducibility of results in various areas of biomedical research.For example, authors who distribute notebooks integrating data, code, and text directly enable others to replicate some or all of the analysis supporting their stated conclusions. 4While such measures do not exclude all possible errors that might call into question the reliability of published findings, scientists adhering to these practices considerably reduce the ambiguity associated with the steps in their workflow subsequent to data acquisition. 5,6ganizations that fund research must take into account these considerations and others in planning new research and development (R&D) programs.The time and money available to obtain answers to the scientific questions of stakeholder interest are generally limited.Reachback tasking that would allow re-analysis of data or models following an original period of performance is not always possible, as studies frequently rely on teams assembled in an ad hoc manner to respond to the requirements of a specific project call.Moreover, publication of results is not a guarantee that the artifacts of a research program will be fully preserved.While many journals have adopted standards for sharing of data and code, compliance with these policies is imperfect. 7,8Indeed, selective publication practices have themselves been implicated as potential sources of bias in the scientific literature. 9,10Finally, the aspirations of open science may conflict with project constraints when supporting data are not suitable for release into the public domain.
Here we describe the efforts of one research program, Measuring Biological Aptitude (MBA), a four-year effort sponsored by the Biological Technologies Office of the Defense Advanced Research Projects Agency (DARPA), to improve the reproducibility of studies performed with the goal of optimizing human performance in a variety of athletic and cognitive military skills tests.As the MBA program involved data encumbered by restrictions related to personal privacy, medical confidentiality, and national defense, a fully open approach to promoting reproducibility was not practical.Instead, the program sponsored the authors of this manuscript to conduct a comprehensive independent verification and validation (IV&V) program to test and evaluate the results generated by the primary program contractors and modeling teams.

Independent verification and validation
DARPA defines IV&V as "the verification and validation of a system or software product by an organization that is technically, managerially, and financially independent from the organization responsible for developing the product" (DARPA Instruction 70).In recent years, IV&V has become a key component of various DARPA research programs both inside and outside the life sciences domain. 11In contrast to the standard in open science, which generally relies on the free and voluntary participation of members of the scientific community to verify the results of third party studies, DARPA's policy suggests that independent efforts to support the integrity of scientific results are of sufficient importance to merit direct funding, using teams selected for their expertise in the relevant technical areas.
According to the MBA Broad Agency Announcement (BAA), primary performers were charged to "identify, understand, and measure the expression circuits (e.g., genetic, epigenetic, metabolomic, etc.) that shape a warfighter's cognitive, behavioral, and physical traits, or phenotypes, related to performance across a set of career specializations."The IV&V team, by contrast, was directed to "verify and validate whether the expression circuits, as measured by the molecular targets identified, directly correlate to dynamic changes in performance traits in the individual and independently confirm…that those circuits correlate to selection success or failure."From this followed a corresponding but distinct schedule of tasks for each group (Figure 1).
In 2019, DARPA selected Lawrence Livermore National Laboratory (LLNL) and the University of Illinois Urbana-Champaign (UIUC) to lead the IV&V component of the MBA program.According to the IV&V plan developed by LLNL, the effort comprised four core focus areas: 1) a centralized secure data storage and exchange platform, 2) quality assurance and quality control checklists applied to data acquisition, 3) test and evaluation of performer modeling products, and 4) an archival software repository and data store.

Secure data storage and exchange platform
While the unit costs of bioinformatic data collection have declined in recent years, the acquisition and processing of largescale omics data remain both financially and computationally expensive. 12For reasons of reliability and cost savings, research sponsors may desire a centralized user facility for data storage and pre-processing, even in projects that involve multiple competing investigators and modeling teams.Moreover, if program managers wish to obtain a comparison of performance across several predictive models, centralized data services may help to maximize the time that data scientists and statisticians are able to devote to model selection while minimizing the risk that ambiguities in outcome labels or other metadata may lead different groups to substantially varying interpretations of the same modeling problem. 13,14separate consideration in research involving human subjects involves data security and privacy.In the U.S., federal regulations require institutional review boards (IRBs) to evaluate each proposed project's provisions for protecting the privacy and confidentiality of human subjects information, regardless of whether a study explicitly plans to include data that is covered by other medical privacy laws. 15In addition, considerations such as the possible re-identification of putatively de-identified health data may warrant additional data protection precautions even when not required by statute or regulation. 16 ensure both data security and data consistency for all teams working on the project, LLNL built a computing enclave for storing and analyzing MBA program data (Figure 2).Following best practices employed by other centralized biomedical data repositories, the enclave implemented cyber security controls at the FISMA Moderate policy level with enhanced controls from the NIST 800-53 and 800-66 information security guidelines for privacy and HIPAA compliance. 17,18,19All accredited users of the enclave were required to complete cyber security training and a human subjects protection course prior to receiving computing accounts. 20Access to the enclave was established through multifactor authentication.
To minimize risks of intentional and/or accidental duplication of human subjects data, the enclave featured a Virtual Network Computing (VNC) portal through which external collaborators could interact with program data.As the enclave excluded other networking protocols for ordinary users, the visual interface allowed modelers to perform analyses in a standard Linux computing environment while imposing a soft barrier against the bulk download of sensitive data.Modelers were allowed to upload new data or software dependencies to the enclave via the data transfer node.However, while outbound transfers of finished analysis were supported via the same pathway, these required the additional step of administrator review and approval.
To permit utilization of high-performance computing (HPC) systems, the LLNL secure enclave was extended to include the Livermore Computing Collaboration Zone (CZ; https://hpc.llnl.gov/hardware/zones-k-enclave).This enabled analysis of multiple omics datasets using leadership-class compute platforms such as Mammoth, a 8,800 core cluster acquired via the National Nuclear Security Administration's Advanced Simulation and Computing (ASC) Program.

Quality assurance and quality control of data acquisition
In recent years, various scientific disciplines and consortia have developed minimum standards for the inclusion of data in both centralized repositories and published meta-analyses. 21These guidelines have encompassed a range of data acquisition formats, including genetic, proteomic, and other biochemical data. 22,23For example, the Human Proteome Organization's Proteomics Standards Initiative developed the Minimum Information About a Proteomics Experiment (MIAPE) standard for mass spectrometry experiments involving protein and peptide identification. 24Similarly, in the human subjects field, the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) statement set minimum metadata standards for collection, archiving, and reporting of epidemiological research data. 25her standards-setting organizations go beyond simple metadata reporting requirements and seek to detail comprehensive processes and systems that can help ensure data quality (quality assurance, or QA) as well as specific tests and benchmarks that can flag errors during and after data collection (quality control, or QC). 26These types of procedural checks have been adopted, for example, by the Metabolomics Quality Assurance and Quality Control Consortium (mQACC) and the Encyclopedia of DNA Elements (ENCODE) Consortium. 27,28The MBA program used this type of standard as a model for its own QA/QC efforts.
LLNL experimentalists generated a scoring rubric for each of the molecular and omics data collection modalities employed in the various human trials throughout MBA.Example rubrics are shown in Table 1 and Extended Data Tables 1-3.A pass/fail checklist was used to determine whether each dataset met the minimum quality standards for use by the modeling teams.Some criteria involved best practices in sample handling and study design, while others were specific to the instrumentation used and, in general, followed manufacturer recommendations.For some types of data collection, including genome sequencing data, open source tools such as multiQC were utilized as components of the scoring framework. 29Following the IV&V team's evaluation of each dataset against the rubrics, a scorecard was transmitted to the performer team or subcontractor responsible for the data collection, and the program office was consulted for a final determination on the inclusion of the dataset in the modeling corpus.
Additionally, UIUC statisticians developed separate sets of metrics for the phenotypic and behavioral data collected during the course of the program.These rubrics were crafted to flag outliers and diagnose other potential data quality issues.The analyses encompassed five general domains: cognition, demographics, human performance, personality, and wearable sensors (see Table 2).Some metrics applied to only a single domain, whereas others (e.g., missing data) were relevant for multiple domains.When there is a quality assurance failure, the Potential Issue column of Table 2 provides plausible mechanisms that may underlie the faulty data collection process.The team also developed customized R software scripts that read in the data and automatically generated tables, figures, and reports.

Test and evaluation of performer modeling products
The primary deliverable from the MBA IV&V effort was the test and evaluation of performer expression circuit models used to predict achievement on military skills tests.While performers trained statistical models to predict pass/fail outcomes for different candidates on a battery of human performance and cognitive tests, the IV&V team was responsible for certifying to the military cadre that the selected molecular observables were in fact predictive of the chosen outcomes.
Several factors complicated the evaluation of performer models according to these criteria, among them: 1) small sample sizes for program cohorts, 2) the potential for subjective evaluation criteria in certain skills tests, and 3) incomplete To mitigate these complications, we implemented a two-pronged model evaluation strategy consisting of both a qualitative component, based on pre-registration of the key mechanistic hypotheses each performer planned to investigate, and a quantitative component, based on an evaluation of the predictions of each performer model against a held-out validation set of true outcome labels.
Hypothesis pre-registration is a technique used in some disciplines to avoid using the same set of data for both hypothesis generation and hypothesis testing. 30Hypotheses proposed by MBA performers at the outset of the modeling effort included a variety of potential biological mechanisms underlying task performance, such as sleep quality, metabolism, muscle tone recovery, and several proposed cognitive/psychological mechanisms.These pre-registration documents were retained by the IV&V team for later determination if the identified predictive biomarkers might plausibly correspond to the pre-specified categories.
For quantitative validation, the IV&V team held back 20-30% of the candidate outcome labels from the primary modeling teams during each year of the MBA program.The outcomes for this validation set were kept fully blinded from performer team members to prevent data snooping. 31Modelers were given all other data from each annual cohort and then asked to submit predictions of the outcomes of the blinded candidates for scoring by the IV&V team.Results were announced at each program review meeting.

Software repository and data archive
Preserving the ability to apply predictive models to new cohorts of individuals following the conclusion of MBA was a key goal of the program.Given the small sizes of individual cohorts, prospective model testing on future data collection was considered a significant component of the overall validation strategy.Furthermore, the IV&V team desired to ensure that, to the extent possible, the models would be independent of the choice of laboratory for omics data processing to avoid vendor lock-in.
To facilitate a single storage location for program data, LLNL data scientists generated a MariaDB database schema to contain all multi-omic, phenotypic, and outcome data collected over the course of the MBA program.As some omics data was too large to practically store within the database itself, the database contained links to the original and processed data files stored in a master data archive.It also contained metadata to track QA/QC results associated with different datasets, as well as to reconcile individual research subjects with their anonymized identifiers.
To support continued usefulness of the predictive models, the IV&V team requested that performers package their analysis and models for future use as research compendia, according to the method of Marwick et al. 32 We chose this format as the R language was the preferred coding environment of the majority of modeling teams.To support long-term portability of the modeling pipelines, containerization of the computing environment using Docker or Singularity was also recommended for each team.

Results
The primary investigator-led teams funded to perform work for MBA offered a high level of cooperation with our IV&V efforts.We were aided by support for our IV&V plan from the research sponsor, particularly when elements of the plan necessitated extra effort by the performer teams, such as in the case of hypothesis pre-registration or periodic data holdbacks.
Data QA/QC added a modest amount of time between the return of results from experimenters and the availability of processed data for use by modelers.However, on several occasions involving both sequencing and mass spectrometry experiments, issues flagged during the QA/QC process spurred additional consultation with the data collection teams and led to process improvement that was incorporated into subsequent data re-analysis.
Regular holdout validation set tests of performer predictive models provided an unbiased, apples-to-apples comparison of model performance that assisted the sponsor in measuring progress against program goals.Unfortunately, program constraints made it difficult to test counterfactual predictions made by the modelers, i.e., predictions that certain individuals would have progressed further in the selection process than they actually did.As a result, measuring improvement over state-of-the-art in quantities such as recall, as envisioned at the outset of MBA, was not possible.Instead, the IV&V team defaulted to the use of prediction accuracy and F-score as the primary endpoints for model evaluation. 33scussion and lessons learned In recent years, studies have demonstrated that diverse types of omics data are predictive of biological phenotypes supporting human performance characteristics. 34Nevertheless, this field of research comes with a unique set of challenges that separate it from the much larger pool of clinical research seeking to drive progress in the medical domain."Success" in the human performance context may be a more multifactorial entity than in the medical context, where it may simply entail the cessation of an identified disease process.Additionally, healthy and, in particular, athletically adept individuals may be more reluctant to participate in invasive specimen collection procedures than individuals already engaged with the medical system.
In working with cohorts that significantly depart from broader population baselines, reference data from publicly available databases may turn out to be of lesser value than modelers initially hope.For example, studies of the metabolic impact of various dietary regimens in aging or pre-diabetic populations may not have high transfer value in the warfighter population.To the extent possible, omics data collection for single individuals over long periods of time may mitigate this issue and limit the need for transfer learning from weakly representative populations.
Alternatively, research sponsors may wish to consider funding short-term but larger multiomic studies that are composed of participants more closely representative of the target population.Phenotypic outcomes could be collected passively and unobtrusively using wearables technology.Cadre members could be polled to determine surrogate endpoints, measurable in this more high-throughput context, that they believe most likely related to their more holistic judgments in the selection process of interest.Additionally, if the surrogate endpoint markers are continuously valued, this type of outcome variable may allow for superior statistical power than dichotomized pass/fail outcome labels. 35 participate in blinded prediction contests, modelers may be reluctant to give up scarce training data samples as a validation holdout when the total number of observations in the dataset is small.Statistical techniques that require checking certain prerequisite assumptions, such as the normality of predictor distributions, may become tedious to implement when small amounts of data are released sequentially.The modeler experience might be subjectively improved if there is enough data to constitute multiple test sets, even if some of those are only partially blinded.For example, Kaggle, the competitive data science website, frequently splits datasets into training, "public leaderboard," and "private leaderboard" components, with the first category being fully accessible to modelers, the second providing a basis for competitors to obtain a preliminary score during the competition, and the final category remaining fully blinded until all models have been submitted. 36Even though the "public leaderboard" data is not truly blinded, since competitors can iteratively query it throughout the model building process, modelers may nevertheless elect to use it judiciously to gauge their performance and to debug basic generalization errors.
Finally, program managers wanting to drive improvements over state-of-the-art outcome prediction, particularly for quantities such as recall, should engage early with program stakeholders to develop means of testing counterfactual predictions made by data scientists.For example, modelers may predict that certain candidates "would have passed" later rounds of a tournament selection process had they been given the opportunity to compete at the higher level.While this information may be of high value from the standpoint of program goals, these predictions are impervious to validation if those individuals are lost to follow-up.

Conclusions
Community-based efforts to promote reproducibility through open sharing of data and code have played an important role in advancing the methodological rigor of many scientific disciplines.We have demonstrated a paradigm for adapting several aspects of this approach to achieve independent verification and validation of results in the context of a research program where unlimited data exchange is not feasible.
Using holdout prediction tests and hypothesis pre-registration, our team was able to certify that predictive modeling benchmarks were achieved in the absence of data snooping.Additionally, a centralized data infrastructure and integrated QA/QC system promoted data integrity and helped to facilitate the preservation of data and algorithms generated in the course of the project for follow-on research efforts.
While our IV&V strategy was developed for projects at the intersection of human performance and defense, we anticipate that similar protocols may prove useful in other research contexts involving multiomic data analysis and sensitive human subjects data.Though the data and trained models from this project are encumbered by distribution restrictions, other artifacts from the study, such as QA/QC rubrics, have been made available to support future work in this area.

Open Peer Review
Current Peer Review Status:

Elizabeth Dhummakupt
U.S. Army DEVCOM Chemical Biological Center, Aberdeen Proving Ground, Maryland, USA This article discusses a method to assist in reproducibility of data collected and generated by studies in which broad questions are asked and large pools of participants are not always available.Additionally, the methods proposed here are useful when full and open data sharing is not always possible, as in the case with research involving national security implications.
The method discussed herein is easily implemented in that adherence to rubrics is required, which can be done without adding software packages or investing in high performance computing.

Is the rationale for developing the new method (or application) clearly explained? Yes
Is the description of the method technically sound?Yes

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required Are the conclusions about the method and its performance adequately supported by the findings presented in the article?Yes Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Multi-omic biomarker discovery -proteins, lipids, metabolites I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Rahuman S. Malik-Sheriff
European Bioinformatics Institute (EMBL-EBI), Cambridge, UK The authors present a methodology they used to ensure the reproducibility and validation of model predictions in the MBA program, where it was not possible to publicly share the raw data due to privacy concerns or the sensitivity of the information.The authors propose and perform independent verification and validation (IV&V) of markers identified using multi-omics data analysis to predict human performance; using holdout validation sets that are not shared with the project contractors to allow unbiased evaluation of model performance.This approach is quite reasonable, the authors present the limitation in assessing the counterfactual prediction of what would have happened in scenarios where the data was not acquired.
Overall, it is a very good piece of work, which fairly discusses the merits and limitations of these approaches.Holdout validation seems like an excellent strategy for the validation of models developed by diverse groups using the same set of data.Adding some quantitative table or figure showing, how many of the hypotheses/findings were independently verified and validated and how many were reproducible/not reproducible/nonverifiable, could strengthen the manuscript.If possible, some high-level categorization of the models/predictions will provide a bit more context.
The IV&V strategy includes implementation of 4 steps: 1) a centralized data storage and exchange platform, 2) quality assurance and quality control (QA/QC) of data collection, 3) test and evaluation of performer models, and 4) an archival software and data repository.These four steps are elaborately discussed in the manuscript, however, there are a few questions raised below.If the authors find it reasonable, it would be useful to discuss / clarify them in the manuscript.
It would be worthwhile discussing how the platform developed in step 1 differs from federated learning platforms and existing strategies, including their merits and demerits, if any.
Baker 2016 reported that over 70% of the researchers failed to reproduce other scientists' experiments and 50% failed to reproduce their own.In step 2, concerning QA/QC, is it sufficient to review the experimental protocols and setups? is it not required to independently reproduce a part of these experiments?Can the authors discuss this?
QA/QC scoring rubric could act as a good guideline for data generation and analysis.However, some of the questions in the QA/QC scoring rubric are a little ambiguous, for example, in question "Was the sample properly recorded?"what does 'properly' mean?
Are questions such as "Were samples thoroughly mixed before loading?" useful?has anyone ever answered no to these questions?
In step 3, the authors suggest registration of the hypothesis, however, in most cases where multiomics data is used, it is often data-driven hypothesis generation, can the authors discuss this?How often are the hypotheses registered are successfully verified by both parties?
Tiwari et al. 2020 showed that about half of the computational model cannot be reproduced using the information provided in the manuscript.In step 3 for validation, did LLNL use the same codes originally developed / used by the researchers?were they verified and validated to ensure that any issues in the code were appropriately addressed?Any changes to the steps in processing data, including data normalization will impact the prediction, so where any standards and best practice guidelines are recommended to researchers to harmonize the analysis. In Step 4, though the data cannot be publicly shared, can some of the software / codes / models?be shared publicly in accordance with FAIR data-sharing principles.It would be critical to ensure these are reusable if the plan is to apply these models on new cohorts.
As a side note, with some rewriting, the abstract and overall article could be well enhanced to appeal to a broader audience.Here are some of the points.These are optional recommendations; the authors can consider these and similar others in the article and address them as they feel appropriate.
In abstract, if "one human performance" is the name of the research program, using punctuation / capitalization appropriately will be helpful, otherwise please rephrase the text accordingly.

○
In abstract: methods could be improved to increase the clarity.The authors could clarify that they present the method they developed and performed a third party independent verification and validation of outcomes of the DRDO-funded projects.

○
In abstract: Holdout validation set, could be clarified.

○
Clarifying in the abstract that human performance prediction models based on multi-omics data will give a bit more context.If space allows, it would be useful to mention that performance of war-warriors performances are predicted.

○
In general, providing some definition for terms such as 'data snooping', and 'expression circuits' would be helpful.

○ Timothy M. Errington
Center for Open Science, Charlottesville, Virginia, USA The article describes the approach the authors implemented as an independent verification and validation (IV&V) effort within in the DARPA funded Measuring Biological Aptitude (MBA) program.This includes implementing four areas to improve and test the reproducibility of the program performers, particularly within the context of a program where not all data can be openly shared.
Overall, this is a very nice example of how open science practices can be implemented within a program like this.My suggestions are directed on clarifications and expansions to help broaden the readership of this article.Additional context about how IV&V is typically implemented in related programs and why these specific four features were developed. 1.
The benefits (and costs) of implementing open science and reproducibility steps in a program like this.That is, IV&V is running 'alongside' the performers.However, outside large programs it is more typical researchers (i.e., performers) disseminate their research and then other researchers (i.e., IV&V) look at the paper and outputs (e.g., data, code).

2.
In addition to discussing the steps that were implemented (and expanding on the rational), how did the specific improve the program?That is, in the authors opinions, did the effort to implement these steps increase the overall model evaluations?At the end of the results the pivot the IV&V team took to evaluate performance is indicated, but the impact of these measures is absent -this doesn't have to be disclosing the specific F-scores, more how the four areas that the IV&V team implemented lead to improvements (however that is defined) of the performers and program at large.

3.
I think if possible, reworking Figure 1 to highlight not only the two tracks and primary responsibilities of the performers and IV&V, but how this maps onto a more general workflow of research (particularly at the intersection of human performance and defense) would be helpful.I think the abstraction beyond the specific program would help strengthen the conclusion the authors are making about this approach being useful for other projects.

4.
Minor comments: Was pre-registration implemented by performers and used by the IV&V team (which is how I read it from the text), or was it implemented by the IV&V team (which is what Figure 1 indicates.Related, are any pre-registrations openly available or is this also 'encumbered by distribution restrictions'? 1. Within the Independent verification and validation section of the Introduction there are extra periods in one of the sentences ("confirm…that those circuits") 2.

Is the rationale for developing the new method (or application) clearly explained? Yes
Is the description of the method technically sound?Yes

Are sufficient details provided to allow replication of the method development and its use by others? Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
No source data required Are the conclusions about the method and its performance adequately supported by the findings presented in the article?Partly section and the insights gained in the discussion section.This connection should explicitly define reproducibility in the given context and elaborate on how it is achieved, especially considering the constraint of not being able to share the data online.The paper could also discuss the limitations and the problems occurred in their context, along with proposing potential enhancements that the funding agency could implement to foster reproducibility in similar situations.I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Introduction
Data snooping -I don't think this term is understood by all.It might be helpful to add another sentence or phrase about data snooping.
Reachback tasking is not a standard term as far as I know and should be modified or explained.
Independent verification and validation (IV&V) -I would like to see a lot more discussion of what this typically looks like and how your implementation compares.A casual review of the literature shows lots of citations about "independent verification and validation (IV&V)" -how does your paper add to this literature?.I cannot tell if your implementation is good, bad, or typical.Given we cannot see any actual modeling results (I assume this is a restriction by the sponsor), I think it is important to include a broader comparison to the existing IV&V literature.

Methods
Standards discussion is good.
Could you really ascertain the answer to some of the QA/QC questions such as "Were collection tubes labeled with unique, traceable codes?" and "Were samples thoroughly mixed before loading?" Can you include any more details about the actual models or their evaluation?This is for the quantitative evaluation.Or maybe at least say explicitly what the limitations of what you are allowed to provide.

Results
Are there any results, or at least the format for results, that you can share?

Other suggestions
Mention privacy preserving federated learning as another option for incorporating data that cannot be shared directly.

Is the rationale for developing the new method (or application) clearly explained? Yes
Is the description of the method technically sound?Yes Reviewer Expertise: Genomics, data sharing solutions, software I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Diagram of the MBA program workflow, denoting the activities of the primary performer teams (top) and corresponding responsibilities of the IV&V team (bottom).

Figure 2 .
Figure 2. Schematic of the MBA data infrastructure with both a physical enclave (left), including primary compute and data transfer nodes, and an extension to HPC infrastructure (right).

Reviewer Report 19
March 2024 https://doi.org/10.5256/f1000research.154122.r233981© 2024 Malik-Sheriff R.This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
the rationale for developing the new method (or application) clearly explained?Yes Is the description of the method technically sound?Partly Are sufficient details provided to allow replication of the method development and its use by others?Partly If any results are presented, are all the source data underlying the results available to ensure full reproducibility?No source data required Are the conclusions about the method and its performance adequately supported by the findings presented in the article?Partly Competing Interests: No competing interests were disclosed.Reviewer Expertise: My research focuses on reproducibility, FAIR principles, knowledge engineering, provenance, and research data management.

Are sufficient details provided
to allow replication of the method development and its use by others?Yes If any results are presented, are all the source data underlying the results available to ensure full reproducibility?No source data required Are the conclusions about the method and its performance adequately supported by the findings presented in the article?Partly Competing Interests: No competing interests were disclosed.

Table 1 .
Example QA/QC scoring rubric for immunophenotyping using the CytoFlex-S flow cytometer.Reference ranges are derived from 'CytoFLEX Platform Instructions for Use,' Beckman Coulter, rev.12/11/2019.In addition to the pass/fail ranges, some rubrics included a 'warn' range for borderline data.
Other coverage in the acquisition of omics data relative to phenotypic data, for which several years of historical data collection were already available.
This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.