<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20190208//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article" dtd-version="1.2" xml:lang="en">
    <front>
        <journal-meta>
            <journal-id journal-id-type="pmc">F1000Research</journal-id>
            <journal-title-group>
                <journal-title>F1000Research</journal-title>
            </journal-title-group>
            <issn pub-type="epub">2046-1402</issn>
            <publisher>
                <publisher-name>F1000 Research Limited</publisher-name>
                <publisher-loc>London, UK</publisher-loc>
            </publisher>
        </journal-meta>
        <article-meta>
            <article-id pub-id-type="doi">10.12688/f1000research.157325.2</article-id>
            <article-categories>
                <subj-group subj-group-type="heading">
                    <subject>Method Article</subject>
                </subj-group>
                <subj-group>
                    <subject>Articles</subject>
                </subj-group>
            </article-categories>
            <title-group>
                <article-title>Structuring data analysis projects in the Open Science era with Kerblam!</article-title>
                <fn-group content-type="pub-status">
                    <fn>
                        <p>[version 2; peer review: 2 approved]</p>
                    </fn>
                </fn-group>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author" corresp="yes">
                    <name>
                        <surname>Visentin</surname>
                        <given-names>Luca</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Software</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-2568-5694</uri>
                    <xref ref-type="corresp" rid="c1">a</xref>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Munaron</surname>
                        <given-names>Luca</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Funding Acquisition</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <contrib contrib-type="author" corresp="no">
                    <name>
                        <surname>Ruffinatti</surname>
                        <given-names>Federico Alessandro</given-names>
                    </name>
                    <role content-type="http://credit.niso.org/">Conceptualization</role>
                    <role content-type="http://credit.niso.org/">Methodology</role>
                    <role content-type="http://credit.niso.org/">Supervision</role>
                    <role content-type="http://credit.niso.org/">Writing &#x2013; Original Draft Preparation</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-3084-0380</uri>
                    <xref ref-type="aff" rid="a1">1</xref>
                </contrib>
                <aff id="a1">
                    <label>1</label>Department of Life Sciences and Systems Biology, University of Turin, Turin, 10136, Italy</aff>
            </contrib-group>
            <author-notes>
                <corresp id="c1">
                    <label>a</label>
                    <email xlink:href="mailto:luca.visentin@unito.it">luca.visentin@unito.it</email>
                </corresp>
                <fn fn-type="conflict">
                    <p>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>4</day>
                <month>4</month>
                <year>2025</year>
            </pub-date>
            <pub-date pub-type="collection">
                <year>2025</year>
            </pub-date>
            <volume>14</volume>
            <elocation-id>88</elocation-id>
            <history>
                <date date-type="accepted">
                    <day>25</day>
                    <month>3</month>
                    <year>2025</year>
                </date>
            </history>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 Visentin L et al.</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <self-uri content-type="pdf" xlink:href="https://f1000research.com/articles/14-88/pdf"/>
            <abstract>
                <sec>
                    <title>Background</title>
                    <p>Structuring data analysis projects, that is, defining the layout of files and folders needed to analyze data using existing tools and novel code, largely follows personal preferences. Open Science calls for more accessible, transparent and understandable research. We believe that Open Science principles can be applied to the way data analysis projects are structured.</p>
                </sec>
                <sec>
                    <title>Methods</title>
                    <p>We examine the structure of several data analysis project templates by analyzing project template repositories present in GitHub. Through visualization of the resulting consensus structure, we draw observations regarding how the ecosystem of project structures is shaped, and what salient characteristics it has.</p>
                </sec>
                <sec>
                    <title>Results</title>
                    <p>Project templates show little overlap, but many distinct practices can be highlighted. We take them into account with the wider Open Science philosophy to draw a few fundamental Design Principles to guide researchers when designing a project space. We present Kerblam!, a project management tool that can work with such a project structure to expedite data handling, execute workflow managers, and share the resulting workflow and analysis outputs with others.</p>
                </sec>
                <sec>
                    <title>Conclusions</title>
                    <p>We hope that, by following these principles and using Kerblam!, the landscape of data analysis projects can become more transparent, understandable, and ultimately useful to the wider community.</p>
                </sec>
            </abstract>
            <kwd-group kwd-group-type="author">
                <kwd>Project Management</kwd>
                <kwd>reproducibility</kwd>
                <kwd>open science</kwd>
                <kwd>workflows</kwd>
                <kwd>data management</kwd>
            </kwd-group>
            <funding-group>
                <award-group id="fund-1" xlink:href="https://doi.org/10.13039/501100021856">
                    <funding-source>Ministero dell'Universit&#x00e0; e della Ricerca</funding-source>
                    <award-id>20222RT5LC</award-id>
                </award-group>
                <funding-statement>This study was carried out within the project &#x201c;SAISEI - Multi-Scale Protocols Generation for Intelligent Biofabrication&#x201d; funded by the Ministero dell'Universit&#x00e0; e della Ricerca (Italian Ministry for Universities and Research) &#x2013; within the Progetti di Rilevante Interesse Nazionale (PRIN) 2022 program (D.D.104 -02/02/2022) [Prot. 20222RT5LC].</funding-statement>
                <funding-statement>
                    <italic>The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.</italic>
                </funding-statement>
            </funding-group>
        </article-meta>
        <notes>
            <sec sec-type="version-changes">
                <label>Revised</label>
                <title>Amendments from Version 1</title>
                <p>This revised version includes a comparison of Kerblam! with similar tools. The role of project structure in terms of reproducibility is also expanded in the introduction. Additionally, usage barriers and the limitations of Kerblam! are discussed. Several paragraphs have been reworded to increase legibility, clarity and flow.</p>
            </sec>
        </notes>
    </front>
    <body>
        <sec id="sec5" sec-type="intro">
            <title>Introduction</title>
            <p>Data analysis is a key step in all scientific experiments. In numerical data-centric fields, it is, in essence, a series of computational steps in which input data is processed by software to produce some output. Usually, the ultimate goal is to create secondary data for human interpretation to produce knowledge and insight into some phenomena. These manipulations can involve downloading input data on local storage, creating workflows and novel software&#x2014;also saved locally&#x2014;and running the analysis on local or remote (&#x201c;cloud&#x201d;) hardware.</p>
            <p>The reproducibility of scientific output is a hot topic in recent years. A scoping report by the European Commission
                <sup>
                    <xref ref-type="bibr" rid="ref1">1</xref>
                </sup> covered this issue in 2020, highlighting results from a popular survey by Nature in 2016,
                <sup>
                    <xref ref-type="bibr" rid="ref2">2</xref>
                </sup> where researchers reported different success rates when trying reproducing experiments, with lowest scores in areas such as Chemistry, Biology and Medicine. Such failure rates may be due to multiple &#x201c;failure points&#x201d;: the complexity of the experimental design being so high as to be irreplicable, missing protocols and other procedures, unavailable input data or analysis code, bad computational reproducibility ascribable to, for example, versioning of packages, etcetera. We believe that one such failure point relates to the structure of the data analysis projects, and the way they are packaged and shown to the public.</p>
            <p>In this article, we will use the phrase &#x201c;data analysis project structure&#x201d; to refer to the way data analysis projects are organized on the actual file system, including the structure of folders on disk, the places where data, code, and workflows are stored, and the format in which the project is shared with the public. Unfortunately, as we will later demonstrate, such structures can vary considerably among researchers, making it difficult for the public to inspect and understand them.</p>
            <p>With the Open Science movement gaining traction in recent years,
                <sup>
                    <xref ref-type="bibr" rid="ref3">3</xref>
                </sup> there is a growing need to standardize how routine data analysis is structured and carried out. A significant milestone in the application of Open Science philosophy to practice are the FAIR principles,
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> which propose characteristics that data should have to be more useful to the wider public. Notably, even though originally thought to provide guidelines for the management of data, FAIR principles have recently been extended to other contexts, such as software.
                <sup>
                    <xref ref-type="bibr" rid="ref5">5</xref>
                </sup> By making data analyses more transparent and intelligible, the standardization of project structure complies with the FAIR principles&#x2019; call for more Findable, Accessible, Interoperable, and Reusable research objects.
                <sup>
                    <xref ref-type="bibr" rid="ref4">4</xref>
                </sup> Efforts are being made from many parts to make reproducible pipelines easier to create and execute by the wider public&#x2014;for example, by leveraging methods such as containerization.
                <sup>
                    <xref ref-type="bibr" rid="ref6">6</xref>&#x2013;
                    <xref ref-type="bibr" rid="ref8">8</xref>
                </sup> However, while new tools and technologies offer unprecedented opportunities to make the whole process of data analysis increasingly transparent and reproducible, their usage still requires time and effort, as well as expertise and sensibility to the issue of standardization and reproducibility by the researcher.</p>
            <p>
In this work, we inspect the structure of many data science and data analysis project templates that are currently available online. Then, we outline best practices and considerations to take into account when thinking about structuring data analysis projects. Following these principles, we propose a simple, lightweight, and extensible project structure that fits many needs and is in line with projects already present in the ecosystem, thus providing a certain level of standardization. Finally, we introduce Kerblam!, a new tool that can be used to work in projects with this standard structure, taking care of common tasks, such as data retrieval and cleanup, workflow management, and containerization support. This could ultimately benefit the scientific community by making others&#x2019; work easier to understand and reproduce, for example, during the peer-review process. We hope that this article will be useful to both established data analysts, prompting them to streamline their data analysis projects, and researchers willing to increase the reproducibility of their data analysis efforts.</p>
        </sec>
        <sec id="sec6">
            <title>Data collection</title>
            <p>To fetch the structure of the most common data analysis projects, we ran two GitHub searches: one for the keywords 
                <italic toggle="yes">cookiecutter</italic> and 
                <italic toggle="yes">data</italic> (
                <monospace>cookiecutter</monospace> is a Python package that allows users to create, or &#x201c;cut,&#x201d; new projects from templates) and the other for the much more generic keywords 
                <italic toggle="yes">project</italic> and 
                <italic toggle="yes">template.</italic>
            </p>
            <p>We downloaded the top 50 repositories from each search sorted by GitHub stars as proxies for popularity and adoption rate. For each project, we either cut it with the 
                <monospace>cookiecutter</monospace> Python package or used it as is (for non-cookiecutter templates). Of these 100 repositories, 87 could ultimately be successfully cut and parsed and were therefore considered. All files and folders from the resulting projects were listed and compiled into a frequency graph.</p>
            <p>Some housekeeping files (like the
                <monospace>.git</monospace> directory and all its content) were stripped from the final search results, as they were deemed irrelevant to the project as a whole. For example, 
                <monospace>.gitkeep</monospace> files, which are commonly used to commit empty directories to version control, were excluded from the final figure. Finally, only files present in at least three or more templates were retained for plotting.</p>
            <p>The analysis was performed with the latest commits of all considered repositories as of the 12th of July 2024. The only exception was the &#x201c;drivendataorg/cookiecutter-data-science&#x201d; repository, for which we fetched version 
                <monospace>1.0</monospace> due to the non-standard parsing requirements of the latest commit.</p>
            <p>The code for this analysis is available online. See the &#x201c;Software availability statement&#x201d; section for more information.</p>
        </sec>
        <sec id="sec7">
            <title>Data interpretation</title>
            <p>The choice of how to structure projects is an issue universally shared by anyone who performs data analysis. This results in a plethora of different tools, folder hierarchies, accepted practices, and customs. To explore the most common practices, we inspected 87 different project templates available on GitHub and produced a frequency graph of shared files and folders, as shown in 
                <xref ref-type="fig" rid="f1">
Figure 1</xref>.</p>
            <fig fig-type="figure" id="f1" orientation="portrait" position="float">
                <label>
Figure 1. </label>
                <caption>
                    <title>Frequency graph of the structure of the 87 most starred data analysis project templates.</title>
                    <p>Only files present in at least three or more templates are shown, as retrieved from GitHub. The size and color intensity of the circle at the tip of each link is proportional to the frequency with which that file or folder is found in different project templates. Red text represents files, while blue text represents folders. The central dot of the root node was assigned an arbitrary size.</p>
                </caption>
                <graphic id="gr1" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/179903/c4bb6057-45fa-4810-b243-02efc562a1bb_figure1.gif"/>
            </fig>
            <p>By looking at this figure, we can point out common patterns in project structuring. However, it must be noted that templates influence each other. For example, many Python data science project templates seem to be modified versions of 
                <ext-link ext-link-type="uri" xlink:href="https://github.com/drivendataorg/cookiecutter-data-science">drivendataorg/cookiecutter-data-science</ext-link>, which has a very high number of stars and is, therefore, probably popular with the community.</p>
            <p>In any case, the two most highly found files are the 
                <monospace>README.md</monospace> file (with a frequency of 
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>77</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.89</mml:mn>
                    </mml:math>
</inline-formula>) and the 
                <monospace>LICENSE</monospace> and 
                <monospace>LICENSE.md</monospace> files (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mrow>
                                <mml:mn>46</mml:mn>
                                <mml:mo>+</mml:mo>
                                <mml:mn>3</mml:mn>
                            </mml:mrow>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.56</mml:mn>
                    </mml:math>
</inline-formula>).</p>
            <p>The 
                <monospace>pyproject.toml</monospace> file at the top level of the repository, which marks the project as a Python package, is also prevalent (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>16</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.18</mml:mn>
                    </mml:math>
</inline-formula>). This is potentially due to the popular &#x201c;cookiecutter-data-science&#x201d; template mentioned before, also highlighting how projects following this template are intimately linked with the usage of Python, potentially exclusively. The predominance of Python-based projects is also noticeable by the presence of 
                <monospace>requirements.txt</monospace> (a file usually used to store Python&#x2019;s package dependencies), 
                <monospace>setup.py</monospace>, and 
                <monospace>setup.cfg</monospace> (now obsolete versions of the 
                <monospace>pyproject.toml</monospace> file used to configure Python&#x2019;s build system). The 
                <monospace>project</monospace> folder at the top level of the templates is most likely the Python package (represented by the 
                <monospace>
__init__.py</monospace> file) that the 
                <monospace>pyproject.toml</monospace> file refers to (the name &#x201c;project&#x201d; is artificial, deriving from the default way that cookiecutter templates were cut).</p>
            <p>The presence of files related to the R programming language (the 
                <monospace>R</monospace> directory, 
                <monospace>.Rbuildignore</monospace>, 
                <monospace>README.Rmd</monospace>) reflects its usage in the data analysis field, although at a lower frequency than Python. The relatively low prevalence of the R programming language could be due to biases introduced by the search queries or to the overwhelming popularity of Python project templates, as well as the fact that the cookiecutter utility itself is written in Python.</p>
            <p>Community-relevant files such as 
                <monospace>CONTRIBUTING.md</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>8</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.09</mml:mn>
                    </mml:math>
</inline-formula>) and 
                <monospace>
CODE_OF_CONDUCT.md</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>5</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.06</mml:mn>
                    </mml:math>
</inline-formula>) show little prevalence in templates. This is also true for the 
                <monospace>CITATION.cff</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>4</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.05</mml:mn>
                    </mml:math>
</inline-formula>) file, which is useful for machine-readable citation data.</p>
            <p>The 
                <monospace>src</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>31</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.36</mml:mn>
                    </mml:math>
</inline-formula>), 
                <monospace>data</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>35</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.40</mml:mn>
                    </mml:math>
</inline-formula>), and 
                <monospace>docs</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>28</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.32</mml:mn>
                    </mml:math>
</inline-formula>) folders are highly represented, containing code, data, and project documentation, respectively. In particular, the 
                <monospace>data</monospace> directory contains with a high frequency the 
                <monospace>raw</monospace>, 
                <monospace>processed</monospace>, 
                <monospace>interim</monospace>, and 
                <monospace>external</monospace> folders to host the different data types&#x2014;input, output, intermediate, and third party&#x2014;according to the structure promoted by the &#x201c;cookiecutter-data-science&#x201d; template. The prevalence of these sub-folders, however, is lower than the frequency of 
                <monospace>data</monospace> itself, which means that the presence of the 
                <monospace>data</monospace> folder is not uniquely due to that specific template. Interestingly, other templates include 
                <monospace>data</monospace> in the 
                <monospace>src</monospace> folder, mixing it with the analysis code. Other common folders present in the 
                <monospace>src</monospace> directory are also the ones promoted by &#x201c;cookiecutter-data-science,&#x201d; but again, as already noted for 
                <monospace>data</monospace>, their occurrence is lower than that of the parent folder, indicating that many different templates adopt 
                <monospace>src</monospace> as a folder name.</p>
            <p>Docker-related files are present, mostly at the top level of the project: 
                <monospace>Dockerfile</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>5</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.06</mml:mn>
                    </mml:math>
</inline-formula>), 
                <monospace>.dockerignore</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mn>4</mml:mn>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.05</mml:mn>
                    </mml:math>
</inline-formula>), and 
                <monospace>docker-compose.yml</monospace> or 
                <monospace>yaml</monospace> (
                <inline-formula>

                    <mml:math display="inline">
                        <mml:mfrac>
                            <mml:mrow>
                                <mml:mn>6</mml:mn>
                                <mml:mo>+</mml:mo>
                                <mml:mn>1</mml:mn>
                            </mml:mrow>
                            <mml:mn>87</mml:mn>
                        </mml:mfrac>
                        <mml:mo>&#x2243;</mml:mo>
                        <mml:mn>0.08</mml:mn>
                    </mml:math>
</inline-formula>). Docker-related files and folders are also present with sub-threshold frequencies in many other forms, often as directories with multiple Dockerfiles in different folders. The presence of the 
                <monospace>docker-compose.yml</monospace> file and docker subdirectories could be indicative of a common need to manage multiple execution environments&#x2014;that work together in the case of Docker Compose&#x2014;throughout the analysis.</p>
            <p>The sparse use of many tools can be appreciated by the number of unique files and folders across all templates. Of the 4195 different files and directories considered by this approach, the vast majority (3908, or 93.16 %) were present in only one template. Looking at directories only, 783 were unique over 864 total (90.63 %). This figure might be inflated owing to the presence of some compiled libraries, files, and Git objects that are included in the analysis and not correctly removed by our filtering. However, we argue that this overwhelmingly high uniqueness would not be significantly affected by manual filtering.</p>
            <p>The small overlap between templates reflects that project structure is, by its nature, a matter of personal preference. Nevertheless, 
                <xref ref-type="fig" rid="f1">
Figure 1</xref> confirms that the core structure of the repositories tends to be similar. This is potentially due to both the epistemic need to share one&#x2019;s own work with others and the technical requirements of research tools, which cause the adoption of community standards either by choice (in the former case) or imposition (in the latter). For instance, the high presence of the 
                <monospace>README.md</monospace> file is a community standard that is broadly shared by the majority of software developers, users, and researchers alike. This adoption is purely for practical reasons: specifically, the need to share the description of the work with others in an obvious (&#x201c;please read me&#x201d;), logical (in the topmost layer of the project layout), and predictable (i.e., used by the wider community) manner.</p>
            <p>Borrowing a term from genetics, the 
                <monospace>README</monospace> file can be thought to be a &#x201c;housekeeping&#x201d; file: without it, the usefulness of a project is severely impaired. In this regard, another possible housekeeping candidate is the 
                <monospace>LICENSE</monospace> file. It is essential to collaborate with the community in the open-source paradigm and is thus commonly found in many software packages. The concession for code reuse is also essential in data analysis projects, both to allow reproducers to replay the initial work and for other researchers to build on previous knowledge. Incidentally, the common presence of the 
                <monospace>LICENSE</monospace> file in the project 
                <italic toggle="yes">template</italic> is interesting. This could be due to either apathy toward licensing issues, leading to picking a &#x201c;default license&#x201d; without many considerations, or a general feeling in individuals that one particular license fits their projects across the board.</p>
            <p>A potentially new housekeeping file that is not yet commonly found is the 
                <monospace>CITATION.cff</monospace> file. This file contains machine-readable citation metadata that can be used by both human and machine users to obtain such information, potentially automatically.</p>
        </sec>
        <sec id="sec8">
            <title>Intervention</title>
            <sec id="sec9">
                <title>Design principles</title>
                <p>The observations made above can all be considered when designing a more broadly applicable project template that may be used in a variety of contexts. To this end, it is helpful to conceptualize some core guiding principles that should be followed by all data analysis projects, particularly under the Open Science paradigm.</p>
                <p>Because data analysis projects often involve writing new software, a data analysis project structure requires support for both 
                    <italic toggle="yes">data analysis</italic> proper and 
                    <italic toggle="yes">software development.</italic> Software development methods fall outside the scope of this work, but some concepts are useful in the context of data analysis, particularly for 
                    <italic toggle="yes">ad hoc</italic> data analysis. For instance, many programming languages require specific folder layouts to create self-contained distributed software. For instance, to create a package with the Python programming language (
                    <ext-link ext-link-type="uri" xlink:href="https://python.org/">https://python.org/</ext-link>), a specific project layout must be followed.
                    <sup>
                        <xref ref-type="bibr" rid="ref9">9</xref>
                    </sup> This is also visible in 
                    <xref ref-type="fig" rid="f1">
Figure 1</xref>, with the presence of the 
                    <monospace>project</monospace> folder and many files specific to Python packages, crucially, in the locations required by Python build backends. Something similar occurs for many programming languages, such as R
                    <sup>
                        <xref ref-type="bibr" rid="ref10">10</xref>
                    </sup> and Rust,
                    <sup>
                        <xref ref-type="bibr" rid="ref11">11</xref>
                    </sup> among others.</p>
                <p>However, a researcher may not want to create self-contained, distributed software. Languages such as Python and R (
                    <ext-link ext-link-type="uri" xlink:href="https://www.r-project.org/">https://www.r-project.org/</ext-link>) can interpret and execute single-file scripts to achieve certain goals (i.e., &#x201c;scripting&#x201d;). As scripting is fast, convenient, and easy to perform, it is the most common method of data analysis. Scripting provides flexibility during the development process; however, this typically exacerbates the fragmentation of project structures. In particular, the environment of execution now becomes much more relevant: which packages are installed and at which versions, the order in which the scripts were read and executed, and, potentially, even the order of 
                    <italic toggle="yes">which lines</italic> are (manually) run becomes important to the success of the overall analysis.</p>
                <p>This increased flexibility is obviously useful for the research process, which requires the ability to change quickly to adapt to new findings, especially during hypothesis-generating &#x201c;exploratory&#x201d; research. The principles presented here aim to retain this essential requirement of adaptability but, at the same time, push for increased standardization of methods, avoiding the most common and dangerous pitfalls that can be encountered during data analysis.</p>
                <p>

                    <italic toggle="yes">1. Use a version control system</italic>
                </p>
                <p>At its core, software is a collection of text files, and this includes data analysis software. While producing code, it is important to record the differences between the different versions of these files. This is very useful, especially during the research process, to &#x201c;retrace our steps&#x201d; or to attempt new methodologies without the fear of losing any previous work. Such records are also useful as provenance information and potentially as proof of authorship, similar to what a laboratory notebook does for a &#x201c;wet-lab&#x201d; experimental researcher.</p>
                <p>There is consolidated software that can be used as a version control system. An overwhelming majority of projects use 
                    <monospace>git</monospace> (
                    <ext-link ext-link-type="uri" xlink:href="https://git-scm.com/">https://git-scm.com/</ext-link>) for this purpose, although others exist. Platforms that integrate 
                    <monospace>git</monospace>, such as GitHub (
                    <ext-link ext-link-type="uri" xlink:href="https://github.com">github.com</ext-link>) and GitLab (
                    <ext-link ext-link-type="uri" xlink:href="https://gitlab.com">gitlab.com</ext-link>), are increasingly used for data analysis, both as a collaboration tool during the project and as a sharing platform afterwards.</p>
                <p>The first principle should therefore be this: 
                    <bold>use a version control system</bold>, such as 
                    <monospace>git</monospace>.</p>
                <p>A few practical observations stem from this principle:
                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>version control encourages good development practices, such as atomic commits, meaningful commit messages, and more, reducing the number of mistakes made while programming, and increasing efficiency by making debugging easier;</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>version control discourages the upload of very large (binary) files; therefore, input and output data cannot be efficiently shared through such a system, incentivizing the deposit of data in online archives and, by extension, favoring the FAIRness of the manipulated data objects;</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>code collaboration and collaboration techniques (such as &#x201c;GitHub Flow&#x201d; or &#x201c;trunk based development&#x201d;
                                <sup>
                                    <xref ref-type="bibr" rid="ref12">12</xref>,
                                    <xref ref-type="bibr" rid="ref13">13</xref>
                                </sup>) can be useful to promote a more efficient development workflow in data analysis disciplines such as bioinformatics, especially in mid- to large-research groups;</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>the core unit of a project should be a code repository, containing everything related to that project, from code to documentation, to configuration.</p>
                        </list-item>
                    </list>
                </p>
                <p>The use of a version control system also has implications for FAIR-ness. Leveraging remote platforms is fundamental to both Findability and Accessibility. Integrations of platforms such as GitHub with archives such as Zenodo (
                    <ext-link ext-link-type="uri" xlink:href="https://zenodo.org/">https://zenodo.org/</ext-link>) allow developers to easily archive for long-term preservation their data analysis code, promoting Accessibility, Findability, and Reusability.</p>
                <p>

                    <italic toggle="yes">2. Documentation is essential</italic>
                </p>
                <p>When working on a data analysis project, documentation is important for both the experimenter themselves and external users. Through ideal documentation, the rationale, process, and potentially the result of the analysis are presented to the user, together with practical steps on how to 
                    <italic toggle="yes">actually</italic> reproduce the work. As with all other aspects of data analysis, documentation takes many different forms, but is the most difficult thing to standardize for one simple reason: documentation is written by humans for human consumption. Documentation is therefore allowed high flexibility in structure, content, form, and delivery method.</p>
                <p>Even though rigid standardization is impossible, some guidelines on how to write effective documentation can still be drawn, often from best practices in the much wider world of open-source software. We have already highlighted the fundamental role of the 
                    <monospace>README</monospace> file and its widespread adoption. This file contains high-level information about the project and is usually the first, and perhaps only, documentation that all users encounter and read. It is therefore essential that core aspects of the project are delivered through the 
                    <monospace>README</monospace> file, such as the following:

                    <list list-type="bullet">
                        <list-item>
                            <label>&#x2022;</label>
                            <p>the aim of the project, in clear, accessible language;</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>methods used to achieve such an aim (and/or a link to further reading material);</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>a guide on how to run the analysis on the user&#x2019;s machine, potentially including information on hardware requirements, software requirements, container deployment methods, and every piece of information a human reproducer might need to execute the analysis;</p>
                        </list-item>
                        <list-item>
                            <label>&#x2022;</label>
                            <p>in an Open Science mindset, including information on how to collaborate on the project and the contact information of the authors is also desirable.</p>
                        </list-item>
                    </list>
                </p>
                <p>Other aspects of the project, such as a list of contributors, may also be included in the 
                    <monospace>README</monospace> file. The 
                    <monospace>README</monospace> file may also be called 
                    <monospace>DESCRIPTION</monospace>, although 
                    <monospace>README</monospace> is a much more widely accepted standard.</p>
                <p>Additional documentation can be added to the project in several ways (see 
                    <xref ref-type="fig" rid="f1">
Figure 1</xref>). A common documentation file is the 
                    <monospace>CONTRIBUTING</monospace> file, which contains information on how to contribute to the project, how authorship of eventual publications will be assigned, and other community-level information. The 
                    <monospace>
CODE_OF_CONDUCT
</monospace> file contains guidelines and policies on how the project is managed, the expected conduct of project members, and potentially how arising issues between project members are resolved. Such a file can be important to either projects open to collaboration from the public or large consortium-level projects. Another important documentation file in the Open Source community is the 
                    <monospace>CHANGELOG</monospace> file. It contains information on how the project changed over time and its salient milestones. For data analysis, it could be used to inform collaborators of important changes in the codebase, methodology, or any other news that might be important to announce and record. Additionally, together with the commit history, 
                    <monospace>CHANGELOG</monospace> files can be useful for tracking the provenance of the analysis, as we have already mentioned.</p>
                <p>A common place to store documentation is the top level of the project repository, but some templates use the 
                    <monospace>docs</monospace> folder, also from guidelines used in the Python community (to use tools such as Sphinx
                    <sup>
                        <xref ref-type="bibr" rid="ref14">14</xref>
                    </sup>).</p>
                <p>We can conclude by reiterating that the second principle states that 
                    <bold>documentation is essential</bold>.</p>
                <p>

                    <italic toggle="yes">3. Be logical, obvious, and predictable</italic>
                </p>
                <p>When a project layout is logical, obvious, and predictable, human users can easily and quickly understand and interact with it.</p>
                <p>To be 
                    <italic toggle="yes">logical</italic>, a layout should categorize files based on their content and logically arrange them according to such categories. To be 
                    <italic toggle="yes">obvious</italic>, this categorization should make sense at a glance, even for non-experts. For instance, a folder named &#x201c;scripts&#x201d; should contain scripts (to be obvious) and only scripts (to be logical). To be 
                    <italic toggle="yes">predictable</italic>, a layout should adhere to community standards, so that it &#x201c;looks&#x201d; similar to other projects. This creates minimal friction when a user first encounters the project and desires to interact with it.</p>
                <p>This principle is also present in aspects of project structure other than layout. For instance, the structure of documentation can also benefit from the same principles but in a different context: logically arranged, obvious in structure, and similar to other projects.</p>
                <p>This might be the most difficult principle to follow because it largely depends on the community as a whole. For this reason, we hope that the analysis shown above, especially in 
                    <xref ref-type="fig" rid="f1">
Figure 1</xref>, and our proposed minimal structure (presented in the next sections) will be useful as guides to effectively implement this principle.</p>
                <p>An additional benefit of standardizing project structure is that it allows for tools to leverage it, helping data analysts in repetitive tasks. For example, if all log files created during a workflow run were to be saved in the same folder, a tool could be configured to quickly delete them if the need arises. Were such files dispersed among input, output and even source file, this task may be more difficult and/or error-prone.</p>
                <p>We can summarize this third principle like this: 
                    <bold>be logical, obvious, and predictable</bold>.</p>
                <p>

                    <italic toggle="yes">4. Promote (easy) reproducibility</italic>
                </p>
                <p>Scientific Reproducibility has been and still is a central issue, particularly in the field of biomedical research.
                    <sup>
                        <xref ref-type="bibr" rid="ref15">15</xref>,
                        <xref ref-type="bibr" rid="ref16">16</xref>
                    </sup> Scientific software developers hold crucial responsibility toward the scientific community of creating reproducible data analysis software.</p>
                <p>&#x201c;Reproducibility&#x201d; can be understood as the ability of a third-party user to understand the research issue investigated by the project, how it was addressed, and practically execute the analysis proper again to obtain a hopefully similar and ideally identical result to the original author(s). This has two benefits: a reproducible analysis evokes more confidence in those who read and review it, and it makes it much easier to repurpose the analysis to similar data in the future.</p>
                <p>In the modern era, scientists are equipped with powerful tools to enable reproducibility, such as containerization and virtualization. While a discussion on how reproducibility can be achieved eludes the scope of this article, the project layout can promote it, especially when all other principles presented here are respected. This increased adoption can be promoted by including obvious and easily implementable reproducibility methods in the project layout directly.</p>
                <p>Workflow managers, such as Nextflow,
                    <sup>
                        <xref ref-type="bibr" rid="ref6">6</xref>
                    </sup> Snakemake,
                    <sup>
                        <xref ref-type="bibr" rid="ref7">7</xref>
                    </sup> and the Common Workflow Language (CWL),
                    <sup>
                        <xref ref-type="bibr" rid="ref8">8</xref>
                    </sup> are key tools to enable reproducibility. They allow a researcher to describe in detail the workflow used, from input files to the final output, offloading the burden of execution to the workflow manager. This allows greater transparency in the methodology used and even makes reproducibility a possibility in more complex data analysis scenarios. Additionally, some workflow managers are structured to promote the reusability of the analysis code, even in different architectures or high-performance computing environments.
                    <sup>
                        <xref ref-type="bibr" rid="ref8">8</xref>
                    </sup>
                </p>
                <p>Usage of these tools requires specific training. Tools like CWL are invaluable when performing complex, distributed data analysis tasks that require leveraging, for instance, distributed, high-performance computing infrastructures. In these cases, bioinformaticians and other data analysts would immediately see the inherent value of such technologies. However, untrained or low-skill analysts might not be aware of, or able to, use them. As we already mentioned in the Introduction, simple tools with low entry barriers would render reproducibility more accessible.</p>
                <p>We conclude this section by stating the fourth and last principle: 
                    <bold>be (easily) reproducible</bold>.</p>
            </sec>
            <sec id="sec10">
                <title>Kerblam!</title>
                <p>We designed a very simple but powerful and flexible project layout together with a project management tool aimed at upholding the principles outlined in the previous section. We named this tool &#x201c;Kerblam!&#x201d;.</p>
                <p>In particular, Kerblam! encourages the use of version control with Git (principle 1), allows documentation to be written near the code that it describes (principle 2), invites the analyst to use a simple, logical, obvious and predictable folder structure (principle 3) and makes using containerization tools such as Docker easier and less burdensome by the end user (principle 4). Additionally, it encourages the use of remote file storage (pushing for FAIR-er data), and allows researchers to create and publish readily executable container images to the public to re-run pipelines for reproducibility purposes (see 
                    <xref ref-type="fig" rid="f2">
Figure 2C</xref>).</p>
                <fig fig-type="figure" id="f2" orientation="portrait" position="float">
                    <label>
Figure 2. </label>
                    <caption>
                        <title>Salient concepts implemented by Kerblam!</title>
                        <p> (A): Basic skeleton of the proposed folder layout for a generic data analysis project associated with relevant Kerblam! commands. Folders are depicted in blue, while files are depicted in red. (B): Data is qualitatively divided into input, output, and temporary data. Input data can be further divided into input data remotely available (i.e., downloadable) and local-only data. The latter is &#x201c;precious&#x201d;, as it cannot be easily recreated. Other types of data are &#x201c;fragile&#x201d;, as they may be created again on the fly. (C): Overview of a generic Kerblam! workflow.</p>
                    </caption>
                    <graphic id="gr2" orientation="portrait" position="float" xlink:href="https://f1000research-files.f1000.com/manuscripts/179903/c4bb6057-45fa-4810-b243-02efc562a1bb_figure2.gif"/>
                </fig>
                <p>The most basic skeleton of the project layout implemented by Kerblam! is shown in 
                    <xref ref-type="fig" rid="f2">
Figure 2A</xref>. The 
                    <monospace>kerblam.toml</monospace> file contains configuration information for Kerblam! and marks the folder as a Kerblam-managed project. Kerblam! provides a number of utility features 
                    <italic toggle="yes">out of the box</italic> on projects that adapt to the layout presented in 
                    <xref ref-type="fig" rid="f2">
Figure 2A</xref> or any other project structure after proper configuration.</p>
                <p>

                    <italic toggle="yes">Data management</italic>
                </p>
                <p>Kerblam! can be used to manage a project&#x2019;s data. It automatically distinguishes between input, output, and intermediate data, based on which folder the data files are saved in: the 
                    <monospace>data</monospace> folder contains intermediate data produced during the execution of the workflows, the 
                    <monospace>data/in</monospace> contains input data, and similarly, 
                    <monospace>data/out</monospace> contains output data. Furthermore, the user can define in the 
                    <monospace>kerblam.toml</monospace> configuration which input data files can be fetched remotely and from which endpoint. This allows Kerblam! to both fetch these files upon request (
                    <monospace>kerblam fetch</monospace>) and distinguish between remotely available input files and local-only files. Local-only files are deemed &#x201c;precious&#x201d; because they cannot be recreated easily. All other data files are &#x201c;fragile,&#x201d; as they can be deleted without repercussion to save disk space (
                    <xref ref-type="fig" rid="f2">
Figure 2B</xref>).</p>
                <p>These distinctions between data types enable further functions of Kerblam!. 
                    <monospace>kerblam data</monospace> shows the number and size of files of all types to quickly check how much disk space is being used by the project. Fragile data can be deleted to save disk space with 
                    <monospace>kerblam data clean</monospace> and precious input data can be exported easily with 
                    <monospace>kerblam data pack</monospace>. 
                    <monospace>kerblam data pack</monospace> can also be used to export output data quickly to be shared with colleagues.</p>
                <p>Allowing Kerblam! to manage the project&#x2019;s data using these tools can offload several chores, usually performed manually by the experimenter.</p>
                <p>

                    <italic toggle="yes">Workflow management</italic>
                </p>
                <p>Kerblam! can manage multiple workflows written for any workflow manager. At its core, it can spawn shell subprocesses that then execute a particular workflow manager, potentially one configured by the user. This allows Kerblam! to manage 
                    <italic toggle="yes">other</italic> workflow managers, making them transparent to the user and with a single access point.</p>
                <p>Kerblam! can also act before and after the workflow manager proper to aid in several tasks. First, it can manage workflows in the 
                    <monospace>src/workflows</monospace> folder 
                    <italic toggle="yes">as if</italic> they were written in the root of the project. This is achieved by moving the workflow files from the said folder to the root of the repository 
                    <italic toggle="yes">just before</italic> execution. This allows for slimmer workflows that do not crowd the root of the repository or conflict with each other, thus being more consistent.</p>
                <p>Second, it allows the concept of 
                    <italic toggle="yes">input data profiles.</italic> Data profiles are best explained using an example. Imagine an input file, 
                    <monospace>input.csv</monospace>, containing some data to be analyzed. The experimenter may wish to test the workflows that they have written with a similar, but, say, smaller 
                    <monospace>test_input.csv</monospace>. Kerblam! allows hot-swapping of these files just before the execution of the workflow manager through profiles. By configuring them in the 
                    <monospace>kerblam.toml</monospace> file, the experimenter can execute a workflow manager (with 
                    <monospace>kerblam run</monospace>) specifying a profile: Kerblam! will then swap these two files just before and just after the execution of the workflow to seamlessly use exactly the same workflow but with different input data, in this case for testing purposes.</p>
                <p>Kerblam! supports 
                    <italic toggle="yes">out of the box</italic> GNU 
                    <monospace>make</monospace> as its workflow manager of choice (
                    <ext-link ext-link-type="uri" xlink:href="https://www.gnu.org/software/make/">https://www.gnu.org/software/make/</ext-link>). Indeed, makefiles can be run directly through it, with no further configuration by the user. Any other workflow manager can be used by writing tiny shell wrappers with the proper invocation command. The range of workflow managers supported out of the box by Kerblam! could increase in the future.</p>
                <p>To run workflows, Kerblam! simply creates subshells, executes the proper commands to launch the workflow (or the user-configured shell script), and eventually pipes the outputs of the subshell to the parent shell. This leaves complete control of the workflow to the workflow manager, making the Kerblam! integration seamless.</p>
                <p>

                    <italic toggle="yes">Containerization support</italic>
                </p>
                <p>Containers can be managed directly using Kerblam!. By writing container recipes in 
                    <monospace>src/dockerfiles</monospace>, Kerblam! can automatically execute workflow managers inside the containers, seamlessly mounting data paths and performing other housekeeping tasks before running the container. As previously stated, Kerblam! works &#x201c;above&#x201d; workflow managers. Therefore, the reader might question the usefulness of a containerization wrapper at the level of Kerblam! if the workflow manager of choice already supports it. This containerization feature is meant to be used when a workflow manager would be inappropriate. For instance, very small analyses might not warrant the increased development overhead to use tools such as CWL. Kerblam! allows even shell scripts to be containerized anyway, making even the smallest analyses reproducible.</p>
                <p>With these capabilities, Kerblam! promotes reproducibility and allows experienced and inexperienced users alike to perform even the simplest analyses in a reproducible manner.</p>
                <p>

                    <italic toggle="yes">Pipeline export</italic>
                </p>
                <p>Workflows managed by Kerblam! with an available container can be automatically exported in a reproducible package through 
                    <monospace>kerblam package</monospace>. This creates a preconfigured container image ready to be uploaded to a container registry of choice together with a compressed tarball containing information on how to (automatically) replay the input analysis: the &#x201c;replay package&#x201d;.</p>
                <p>The process automatically strips all unneeded project files, leading to small container images.</p>
                <p>The replay package can be inspected manually by a potential examiner and either re-run manually or through the convenience function 
                    <monospace>kerblam replay</monospace>, which recreates the same original project layout, fetches the input container, and runs the packaged workflow.</p>
                <p>

                    <italic toggle="yes">The Kerblam! analysis flow</italic>
                </p>
                <p>Kerblam! favors a very specific methodology when analyzing data, starting with an empty 
                    <monospace>git</monospace> repository. First, upload the input data to a remote archive (in theory, promoting FAIR-er data). Then, configure Kerblam! to download the input data and write code and workflows for its analysis, potentially in isolated containers or with specific workflow management tools. During development, periodically clean out intermediate and output files to check whether the correct execution of the analysis has become dependent on the local-only state. Finally, package the results and pipelines into the respective environments and share them with the wider public (e.g., as a GitHub release or in an archive such as Zenodo).</p>
                <p>We believe that this methodology is simple yet flexible and robust, allowing for high-quality analyses in a wide variety of scenarios. This workflow can also be thought of as &#x201c;stateless&#x201d;, as it attempts to steer the user away from storing and relying upon previous runs of the same workflows. This philosophy makes replication easier and less error-prone as, essentially, every workflow run is independent from any other.</p>
                <p>This &#x201c;analysis flow&#x201d; contrasts with other tools available online that attempt to solve or ameliorate the same issues that Kerblam! addresses. We have selected two of them to highlight these differences: &#x201c;data science operations&#x201d;, or DSO (
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/Boehringer-Ingelheim/dso">https://github.com/Boehringer-Ingelheim/dso</ext-link>) and &#x201c;Data Analysis Project management&#x201d;, or DAP (
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/molinerisLab/dap">https://github.com/molinerisLab/dap</ext-link>).</p>
                <p>DSO, similarly to Kerblam!, leverages other tools to provide its functions, such as Data Version Control (DVC, 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/iterative/dvc">https://github.com/iterative/dvc</ext-link>), Quarto (
                    <ext-link ext-link-type="uri" xlink:href="https://quarto.org/">https://quarto.org/</ext-link>), and Git (
                    <ext-link ext-link-type="uri" xlink:href="https://git-scm.com">https://git-scm.com</ext-link>), among others. It revolves around the concept of &#x201c;stage&#x201d;, a single step of the analysis workflow, with predefined inputs and outputs. To isolate each step, it creates individual &#x201c;stage folders&#x201d; with input, output, source and report folders as well as configuration files. By leveraging extensive configuration and DVC, the authors of DSO seem to prefer an analysis flow that preserves and version-controls both code and data, syncing them both with remote endpoints for collaboration. DVC, in particular, is aimed at the management of Machine Learning models. In such models, recording exactly which input data was used to train a specific version of a model is essential, so such an approach may be warranted. Indeed, DVC lets users &#x201c;commit&#x201d; and remotely store data similarly to code. However, we argue that for most data analysis tasks such versioning is not necessary, as the input data stays largely the same and the computational demands of the analysis are not so high to require careful storage of all output artifacts.</p>
                <p>DAP focuses on the Python programming language, leveraging Anaconda (
                    <ext-link ext-link-type="uri" xlink:href="https://anaconda.org/">https://anaconda.org/</ext-link>) for environment management and Snakemake
                    <sup>
                        <xref ref-type="bibr" rid="ref7">7</xref>
                    </sup> for workflow management. DAP also uses 
                    <monospace>direnv</monospace> (
                    <ext-link ext-link-type="uri" xlink:href="https://direnv.net/">https://direnv.net/</ext-link>) to update PATH variable with useful shortcuts. DAP creates two main folders to structure the project: a &#x201c;workflow&#x201d; folder, with code and configuration files, and a &#x201c;workspaces&#x201d; folder, with different project &#x201c;versions&#x201d;. DAP &#x201c;versions&#x201d; fulfill similar goals as DSO/DVC commits do. Each version of the project is a different workspace folder, which contains symbolic links to a specific (sub) set of files in the &#x201c;workflow&#x201d; folder for this specific &#x201c;version&#x201d; of the project. A DAP project also uses Git (
                    <ext-link ext-link-type="uri" xlink:href="https://git-scm.com/">https://git-scm.com/</ext-link>) to provide overall version control. Similar to DSO, by laying out different versions of the same project next to each other, DAP also promotes a &#x201c;state-based&#x201d; workflow of sorts. When changes to workflows are required, the whole project is copied over, edited, and preserved alongside older copies.</p>
                <p>While this is indubitably valuable if one often needs to refer to previous versions of the project, we believe it adds an additional layer of complexity in file and folder structure. This is especially due to the extensive usage of symbolic links, references and other opaque techniques that, while increasing development speed, might hinder the accessibility of the project to new collaborators or reproducers.</p>
            </sec>
            <sec id="sec1.2">
                <title>Issues and limitations</title>
                <p>Kerblam! is a command-line tool. This means that experience with command line tools is required to use it. This is indeed quite an obstacle when approaching researchers not trained in computer science, as the real and perceived complexity of the command line triggers repudiation of the whole practice. The development of a graphical user interface, as well as interactive and accessible training materials may help surmount this barrier.</p>
                <p>Similarly, the usage of Git from the command line may incur the same issues. Fortunately, many graphical interfaces to Git exist. The Git website hosts a list of them, divided by operating system: 
                    <ext-link ext-link-type="uri" xlink:href="https://git-scm.com/downloads/guis">https://git-scm.com/downloads/guis</ext-link>.</p>
                <p>While external dependencies were kept at a minimum, the installation of Kerblam! still requires the user to autonomously install other tools, such as Git and Docker. This might provide additional friction when beginning to use the tool. Methods to completely overcome these issues are unclear.</p>
                <p>

                    <italic toggle="yes">Availability</italic>
                </p>
                <p>Kerblam! is a free and open-source software available on GitHub at 
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/MrHedmad/kerblam">https://github.com/MrHedmad/kerblam</ext-link>. It is written in Rust and may be compiled to support GNU/Linux-flavored operating systems, MacOS, and Windows. Alternatively, GitHub releases provide precompiled artifacts for both operating systems. The full documentation of Kerblam! is available at 
                    <ext-link ext-link-type="uri" xlink:href="https://kerblam.dev/">https://kerblam.dev</ext-link>. Active support for Kerblam! and its development are guaranteed for the foreseeable future.</p>
            </sec>
        </sec>
        <sec id="sec11" sec-type="conclusions">
            <title>Conclusions</title>
            <p>Structuring data analysis projects is a personal matter that is heavily dependent on the preferences of the individuals who conduct the analysis. Nevertheless, best practices arise and can be individuated in this fragmented landscape.</p>
            <p>In this study, we aimed to provide such guidelines and include a robust tool to leverage the regularity of such standardized layout. As the proposed layout is, for all intents and purposes, largely arbitrary, Kerblam! can be configured to operate in any layout.</p>
            <p>Through these and potentially future standardization efforts, tools such as containerization and workflow managers can become more mainstream and even routine, leading to an overall more mature and scientifically rigorous way to analyze data of any kind.</p>
        </sec>
        <sec id="sec13">
            <title>Author&#x2019;s contributions</title>
            <p>Conceptualization: L.V., L.M., and F.A.R.; Software: L.V.; Methodology: L.V. and F.A.R; Funding Acquisition: L.M.; Writing - Original Draft Preparation: L.V., L.M., and F.A.R.; Supervision: L.M. and F.A.R.</p>
        </sec>
    </body>
    <back>
        <sec id="sec16" sec-type="data-availability">
            <title>Code and Data availability statement</title>
            <p>The raw data fetched by the analysis of project templates (e.g., list of fetched repositories, detected frequencies) are available on Zenodo.</p>
            <p>Zenodo: Archival data for Kerblam Project structure. 
                <ext-link ext-link-type="uri" xlink:href="https://zenodo.org/doi/10.5281/zenodo.13627213">10.5281/zenodo.13627213</ext-link>.
                <sup>
                    <xref ref-type="bibr" rid="ref17">17</xref>
                </sup>
            </p>
            <p>This project contains the following underlying data:
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>

                            <monospace>data_cookies.json</monospace>. The list of repositories as fetched by the Github Cli utility 2.55.0 on 2024-07-12 with the command 
                            <monospace>gh search repos cookiecutter data --sort stars --json stargazersCount,url --visibility public -L 50</monospace>
                        </p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>

                            <monospace>data_generic.json</monospace>. The same as above, with the command 
                            <monospace>gh search repos research project template --sort stars --json stargazersCount,url --visibility public -L 50</monospace>
                        </p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>

                            <monospace>repos.tar.gz</monospace>. The resulting (fetched) repositories</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>

                            <monospace>data.json</monospace>. The combination of 
                            <monospace>data_cookies.json</monospace> and 
                            <monospace>data_generic.json</monospace>
                        </p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>

                            <monospace>plot.png</monospace> and 
                            <monospace>plot.pdf</monospace>. The plots generated with the information in 
                            <monospace>results.csv</monospace>
                        </p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>

                            <monospace>results.csv</monospace>. The result of the folder and file enumeration of the repositories in the 
                            <monospace>repos.tar.gz</monospace> file, with the following columns:

                            <list list-type="bullet">
                                <list-item>
                                    <label>&#x25cb;</label>
                                    <p>

                                        <monospace>path</monospace>. The full path from the root of the 
                                        <monospace>repos</monospace> directory to the file</p>
                                </list-item>
                                <list-item>
                                    <label>&#x25cb;</label>
                                    <p>

                                        <monospace>count</monospace>. The frequency of this specific item in the various repositories</p>
                                </list-item>
                                <list-item>
                                    <label>&#x25cb;</label>
                                    <p>

                                        <monospace>types</monospace>. An enumeration of either &#x201c;directory&#x201d; for directories or &#x201c;file&#x201d; for files</p>
                                </list-item>
                            </list>
                        </p>
                    </list-item>
                </list>
            </p>
            <p>Data are available under the terms of the 
                <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">Creative Commons Zero &#x201c;No rights reserved&#x201d; data waiver</ext-link> (CC0 1.0 Public domain dedication).</p>
        </sec>
        <sec id="sec12">
            <title>Software availability statement</title>
            <p>The code for the analysis is available on GitHub and is archived on Zenodo.
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Source code available from: 
                            <ext-link ext-link-type="uri" xlink:href="https://github.com/MrHedmad/ds_project_structure">https://github.com/MrHedmad/ds_project_structure
</ext-link>
                        </p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Archived source code available from: 
                            <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.14611208">https://doi.org/10.5281/zenodo.14611208</ext-link> Luca &#x201c;Hedmad&#x201d; Visentin. (2024). MrHedmad/ds_project_structure: Project Structure version 2 (Version 2). Zenodo.</p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>License: MIT License.</p>
                    </list-item>
                </list>
            </p>
            <p>Kerblam! is available on GitHub and archived at every release in Zenodo.
                <list list-type="bullet">
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Source code available from: 
                            <ext-link ext-link-type="uri" xlink:href="https://github.com/MrHedmad/kerblam">https://github.com/MrHedmad/kerblam</ext-link> and 
                            <ext-link ext-link-type="uri" xlink:href="https://kerblam.dev/">https://kerblam.dev/</ext-link>
                        </p>
                    </list-item>
                    <list-item>
                        <label>&#x2022;</label>
                        <p>Archived source code available from: 
                            <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.5281/zenodo.14528820">https://doi.org/10.5281/zenodo.14528820</ext-link> Visentin, L. (2024). Kerblam! (v1.2.0). Zenodo.License: MIT License.</p>
                    </list-item>
                </list>
            </p>
        </sec>
        <ack>
            <title>Acknowledgements</title>
            <p>Tha authors acknowledge that a version of this manuscript was deposited as a pre-print in the arXiv repository with DOI: 
                <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.48550/arXiv.2410.10513">https://doi.org/10.48550/arXiv.2410.10513</ext-link>.</p>
        </ack>
        <ref-list>
            <title>Bibliography</title>
            <ref id="ref1">
                <label>1</label>
                <mixed-citation publication-type="journal">
                    <collab>European Commission: Directorate-General for Research and Innovation,</collab>
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Baker</surname>
                            <given-names>L</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cristea</surname>
                            <given-names>I</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Errington</surname>
                            <given-names>T</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <source>

                        <italic toggle="yes">Reproducibility of scientific results in the EU &#x2013; Scoping report.</italic>
</source>Lusoli, W. editor. Publications Office,<year>2020</year>.
                    <pub-id pub-id-type="doi">10.2777/341654</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref2">
                <label>2</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Baker</surname>
                            <given-names>M</given-names>
                        </name>
</person-group>:
                    <article-title>1,500 scientists lift the lid on reproducibility.</article-title>
                    <source>

                        <italic toggle="yes">Nature.</italic>
</source>
                    <year>2016</year>;<volume>533</volume>:<fpage>452</fpage>&#x2013;<lpage>454</lpage>.
                    <pub-id pub-id-type="doi">10.1038/533452a</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref3">
                <label>3</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Bertram</surname>
                            <given-names>MG</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Sundin</surname>
                            <given-names>J</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Roche</surname>
                            <given-names>DG</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Open Science.</article-title>
                    <source>

                        <italic toggle="yes">Curr. Biol.</italic>
</source>
                    <year>August 7, 2023</year>;<volume>33</volume>(<issue>15</issue>):<fpage>R792</fpage>&#x2013;<lpage>R797</lpage>.
                    <pub-id pub-id-type="doi">10.1016/j.cub.2023.05.036</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref4">
                <label>4</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Wilkinson</surname>
                            <given-names>MD</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Dumontier</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Aalbersberg</surname>
                            <given-names>IJJ</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>The FAIR Guiding Principles for Scientific Data Management and Stewardship.</article-title>
                    <source>

                        <italic toggle="yes">Sci. Data.</italic>
</source>
                    <year>March 15, 2016</year>;<volume>3</volume>(<issue>1</issue>):<fpage>160018</fpage>.
                    <pub-id pub-id-type="pmid">26978244</pub-id>
                    <pub-id pub-id-type="doi">10.1038/sdata.2016.18</pub-id>
                    <pub-id pub-id-type="pmcid">PMC4792175</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref5">
                <label>5</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Barker</surname>
                            <given-names>M</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Chue</surname>
                            <given-names>NP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Hong</surname>
                            <given-names>DS</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Introducing the FAIR Principles for Research Software.</article-title>
                    <source>

                        <italic toggle="yes">Sci. Data.</italic>
</source>
                    <year>October 14, 2022</year>;<volume>9</volume>(<issue>1</issue>):<fpage>622</fpage>.
                    <pub-id pub-id-type="pmid">36241754</pub-id>
                    <pub-id pub-id-type="doi">10.1038/s41597-022-01710-x</pub-id>
                    <pub-id pub-id-type="pmcid">PMC9562067</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref6">
                <label>6</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Tommaso</surname>
                            <given-names>D</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Paolo</surname>
                            <given-names>MC</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Floden</surname>
                            <given-names>EW</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Nextflow Enables Reproducible Computational Workflows.</article-title>
                    <source>

                        <italic toggle="yes">Nat. Biotechnol.</italic>
</source>
                    <year>April 2017</year>;<volume>35</volume>(<issue>4</issue>):<fpage>316</fpage>&#x2013;<lpage>319</lpage>.
                    <pub-id pub-id-type="pmid">28398311</pub-id>
                    <pub-id pub-id-type="doi">10.1038/nbt.3820</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref7">
                <label>7</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>M&#x00f6;lder</surname>
                            <given-names>F</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Jablonski</surname>
                            <given-names>KP</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Letcher</surname>
                            <given-names>B</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Sustainable Data Analysis with Snakemake.</article-title>
                    <source>

                        <italic toggle="yes">F1000Res.</italic>
</source>
                    <year>April 19, 2021</year>;<volume>10</volume>.
                    <pub-id pub-id-type="doi">10.12688/f1000research.29032.2</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref8">
                <label>8</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Crusoe</surname>
                            <given-names>MR</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Abeln</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Iosup</surname>
                            <given-names>A</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language.</article-title>
                    <source>

                        <italic toggle="yes">Commun. ACM.</italic>
</source>
                    <year>May 20, 2022</year>;<volume>65</volume>(<issue>6</issue>):<fpage>54</fpage>&#x2013;<lpage>63</lpage>.
                    <pub-id pub-id-type="doi">10.1145/3486897</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref9">
                <label>9</label>
                <mixed-citation publication-type="other">
                    <article-title>Packaging Python Projects - Python Packaging User Guide.</article-title>Accessed August 2, 2024.
                    <ext-link ext-link-type="uri" xlink:href="https://packaging.python.org/en/latest/tutorials/packaging-projects/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref10">
                <label>10</label>
                <mixed-citation publication-type="other">
                    <article-title>3 Package Structure and State &#x2013; R Packages (2e).</article-title>Accessed August 2, 2024.
                    <ext-link ext-link-type="uri" xlink:href="https://r-pkgs.org/structure.html">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref11">
                <label>11</label>
                <mixed-citation publication-type="other">
                    <article-title>Creating a New Package - The Cargo Book.</article-title>Accessed August 2, 2024.
                    <ext-link ext-link-type="uri" xlink:href="https://doc.rust-lang.org/cargo/guide/creating-a-new-project.html">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref12">
                <label>12</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Appleton</surname>
                            <given-names>B</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Berczuk</surname>
                            <given-names>S</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Cabrera</surname>
                            <given-names>R</given-names>
                        </name>
</person-group>:
                    <article-title>Streamed Lines: Branching Patterns for Parallel Software Development.</article-title>
                    <year>1998</year>.
                    <ext-link ext-link-type="uri" xlink:href="https://www.semanticscholar.org/paper/Streamed-Lines%3A-Branching-Patterns-for-Parallel-Appleton-Berczuk/1d92503a8080927bf91f122569afd69816df120b">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref13">
                <label>13</label>
                <mixed-citation publication-type="other">
                    <collab>GitHub Docs</collab>:
                    <article-title>GitHub Flow.</article-title>Accessed August 8, 2024.
                    <ext-link ext-link-type="uri" xlink:href="https://docs.github.com/en/get-started/using-github/github-flow">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref14">
                <label>14</label>
                <mixed-citation publication-type="other">
                    <article-title>Sphinx &#x2014; Sphinx Documentation.</article-title>Accessed August 8, 2024.
                    <ext-link ext-link-type="uri" xlink:href="https://www.sphinx-doc.org/en/master/">Reference Source</ext-link>
                </mixed-citation>
            </ref>
            <ref id="ref15">
                <label>15</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Errington</surname>
                            <given-names>TM</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Denis</surname>
                            <given-names>A</given-names>
                        </name>

                        <name name-style="western">
                            <surname>Perfito</surname>
                            <given-names>N</given-names>
                        </name>

                        <etal/>
</person-group>:
                    <article-title>&#x201c;Challenges for Assessing Replicability in Preclinical Cancer Biology.&#x201d; Edited by Peter Rodgers and Eduardo Franco.</article-title>
                    <source>

                        <italic toggle="yes">elife.</italic>
</source>
                    <year>December 7, 2021</year>;<volume>10</volume>:<fpage>e67995</fpage>.
                    <pub-id pub-id-type="pmid">34874008</pub-id>
                    <pub-id pub-id-type="doi">10.7554/eLife.67995</pub-id>
                    <pub-id pub-id-type="pmcid">PMC8651289</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref16">
                <label>16</label>
                <mixed-citation publication-type="journal">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Ioannidis</surname>
                            <given-names>JPA</given-names>
                        </name>
</person-group>:
                    <article-title>Why Most Published Research Findings Are False.</article-title>
                    <source>

                        <italic toggle="yes">PLoS Med.</italic>
</source>
                    <year>August 30, 2005</year>;<volume>2</volume>(<issue>8</issue>):<fpage>e124</fpage>.
                    <pub-id pub-id-type="pmid">16060722</pub-id>
                    <pub-id pub-id-type="doi">10.1371/journal.pmed.0020124</pub-id>
                    <pub-id pub-id-type="pmcid">PMC1182327</pub-id>
                </mixed-citation>
            </ref>
            <ref id="ref17">
                <label>17</label>
                <mixed-citation publication-type="other">
                    <person-group person-group-type="author">

                        <name name-style="western">
                            <surname>Visentin</surname>
                            <given-names>L</given-names>
                        </name>
</person-group>:
                    <article-title>Archival Data for Kerblam Project Structure.</article-title>
                    <source>

                        <italic toggle="yes">Zenodo.</italic>
</source>
                    <year>September 2, 2024</year>.
                    <pub-id pub-id-type="doi">10.5281/zenodo.13627214</pub-id>
                </mixed-citation>
            </ref>
        </ref-list>
    </back>
    <sub-article article-type="reviewer-report" id="report365311">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.172753.r365311</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Silverstein</surname>
                        <given-names>Priya</given-names>
                    </name>
                    <xref ref-type="aff" rid="r365311a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0003-0095-339X</uri>
                </contrib>
                <aff id="r365311a1">
                    <label>1</label>Ashland University, Ashland, USA</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>4</day>
                <month>3</month>
                <year>2025</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 Silverstein P</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport365311" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.157325.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>Thank you for the opportunity to review this article submitted to F1000 on how data analysis projects are structured. I enjoyed reading the article -- it was well written, concise, and mostly understandable to me despite being well outside of my area of expertise.</p>
            <p> </p>
            <p> To contextualise my review, I am a psychologist and metascientist who has experience sharing analysis code and data for a variety of projects, but not much confidence in the fact that I have been doing this well, which is one of the reasons I was excited to review this paper! I have used Github, but infrequently, and I'll admit I do not use it in my own usual data analysis management workflow. For these reasons, I think I am unable to give a very detailed or helpful review, due to some of the article going "over my head". However, I hope that different perspective might be in some way helpful, as hearing from those outside of our fields sometimes is. I only have small comments for potential improvement or extension, which the authors can take or leave.&#x00a0;</p>
            <p> </p>
            <p> 1. In the introduction, it would be nice to set up a little more why structuring data analysis projects well is so important, and that one of the reasons is low reproducibility rates. You do of course talk about this in your recommendations, but I think it might also be helpful to lead with the low reproducibility rates in different fields. I'm not sure if any papers have looked at&#x00a0;
                <italic>where</italic>&#x00a0;results fail to reproduce, but from trying to reproduce many results myself I know that bad data analysis file structure can definitely contribute to this (for example, if it is not clear which are the final files to be used and where and in which order code should be run).&#x00a0;</p>
            <p> </p>
            <p> 2. I expected a little more discussion of&#x00a0;
                <italic>broadly</italic>&#x00a0;the different ways to structure data analysis projects for sharing and their relative pros and cons, for example the difference between sharing your data and code in the same folder (i.e. all the data files that one script calls on together), vs. having data and code stored separately. I think it's also interesting to think about how this intersects with skill level -- because I am a relative novice with R, I feel more comfortable sharing all data and code that will be used together in the same folder, as then I can just have a note in the README that tells people to download the whole folder together and then set the working directory to the source file location when running.&#x00a0;</p>
            <p> </p>
            <p> 3. In general I think this article is pitched to a higher skill level than I possess, so the authors may think about whether this is intentional or whether they are hoping to help and change the behaviours of a wide variety of data analysts. It is OK either way, but will inform the direction the article takes.&#x00a0;</p>
            <p> </p>
            <p> 4. Related to this, I wonder whether the authors wish to keep the article very git-focused (which again, is OK if intentional) or whether there could be a brief mention of other types/solutions for version control and their relative barriers of entry.</p>
            <p> </p>
            <p> 5. Lastly, I think the introduction of Kerblam currently feels a little strange -- it is neither the focus of the article that everything else leads up to, nor is it an afterthought, it is kind of in-between. Perhaps the authors could think about how they would like Kerblam to fit into the article and make some adjustments accordingly.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the method technically sound?</p>
            <p>Yes</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Yes</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Psychology, Metascience</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment13628-365311">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Visentin</surname>
                            <given-names>Luca</given-names>
                        </name>
                        <aff>Department of Life Sciences, University of Turin, Turin, Italy</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>25</day>
                    <month>3</month>
                    <year>2025</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We sincerely thank you for your review. We are glad that you found our article relatively accessible, as that was one of our goals when writing it. We are also very happy to receive comments from a researcher such as yourself, as perspectives from non-bioinformaticians on bioinformatics-oriented tools are rare to find.</p>
                <p> </p>
                <p> We have taken your suggestions into considerations as we drafted a second version of the article. To be more specific, we have expanded the introduction with data from slightly old but popular articles regarding the low rates of reproduction. Although it makes intuitive sense that project structure would play a role in reproducibility rates (or at least that's what we believe), we were unable to find specific numerical data on the contribution of each reproducibility risk factor to the overall reproducibility rate. We cannot therefore reliably conclude how much of an impact project structure standardization would have on reproducibility rates. It would be an interesting future research prospect indeed.</p>
                <p> </p>
                <p> Related to your second point, our analysis combines different project structures, but does not try to cluster them into "kinds". We believe that this would be a requirement to comment on the pros and cons of different types of project structures. Originally, we considered submitting questionnaires to data analysts and researchers in general regarding the structure they use when dealing with data. This would have been very interesting and would lead to a more detailed and informed discussion about what kinds of project structures are there and how they are used, as well as why researchers use them. However, we determined we did not have the means to reach a large enough sample of researchers from many different backgrounds to do this effectively, so we relied on Github-available data instead although we believe it is an inferior approach.</p>
                <p> </p>
                <p> Regarding skill level, we have included an explicit reference to it in the fourth principle. We hoped that the article would be interesting to both novices and to expert data analysts, and it was written with this in mind. We believe that your insightful comments are a sign that we managed to do that, but we have nonetheless added an explicit reference to our target audience in the introduction. We have added a section, "Issues and limitations", discussing these barriers of entry.</p>
                <p> </p>
                <p> The reliance of Kerblam! to Git is indeed by choice: other version control systems exist, but none are as widespread as Git, so we believe that consideration of other VCSs here would be out of scope. In any case, Kerblam! does not *enforce* the usage of Git, it only encourages it. If the users desires to use, for instance, Mercurial as their VCS of choice, they may do so and still use Kerblam! to its full capabilities. In the "Issues and limitations", we mention graphical user interfaces to Git that a user may use instead of the terminal.</p>
                <p> </p>
                <p> We agree that the introduction to Kerblam! was a bit rough and detached from the rest of the article. We have reworded it significantly in Version 2. We hope it makes the transition a little more fluid.</p>
            </body>
        </sub-article>
    </sub-article>
    <sub-article article-type="reviewer-report" id="report363657">
        <front-stub>
            <article-id pub-id-type="doi">10.5256/f1000research.172753.r363657</article-id>
            <title-group>
                <article-title>Reviewer response for version 1</article-title>
            </title-group>
            <contrib-group>
                <contrib contrib-type="author">
                    <name>
                        <surname>Cujba</surname>
                        <given-names>Rodica</given-names>
                    </name>
                    <xref ref-type="aff" rid="r363657a1">1</xref>
                    <role>Referee</role>
                    <uri content-type="orcid">https://orcid.org/0000-0002-7982-6184</uri>
                </contrib>
                <aff id="r363657a1">
                    <label>1</label>Technical University of Moldova, Chi&#x0219;in&#x0103;u, Moldova</aff>
            </contrib-group>
            <author-notes>
                <fn fn-type="conflict">
                    <p>
                        <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                </fn>
            </author-notes>
            <pub-date pub-type="epub">
                <day>20</day>
                <month>2</month>
                <year>2025</year>
            </pub-date>
            <permissions>
                <copyright-statement>Copyright: &#x00a9; 2025 Cujba R</copyright-statement>
                <copyright-year>2025</copyright-year>
                <license xlink:href="https://creativecommons.org/licenses/by/4.0/">
                    <license-p>This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</license-p>
                </license>
            </permissions>
            <related-article ext-link-type="doi" id="relatedArticleReport363657" related-article-type="peer-reviewed-article" xlink:href="10.12688/f1000research.157325.1"/>
            <custom-meta-group>
                <custom-meta>
                    <meta-name>recommendation</meta-name>
                    <meta-value>approve</meta-value>
                </custom-meta>
            </custom-meta-group>
        </front-stub>
        <body>
            <p>The paper focuses on standardizing the structure of data analysis projects to enhance reproducibility and transparency within the framework of Open Science. The study reveals a significant lack of consistency in how researchers organize their data analysis projects. This inconsistency impedes reproducibility and complicates the ability of others to understand and utilize the work. The authors emphasize the importance of aligning data analysis project structures with the FAIR principles (Findable, Accessible, Interoperable, and Reusable). They argue that standardization fosters these principles. The researchers analyzed a large number of data analysis project templates from GitHub to identify common practices and patterns. A frequency analysis of files and folders was performed.</p>
            <p> The paper emphasizes the importance of reproducibility and transparency in Open Science and introduces a practical tool (Kerblam!) to support these goals.</p>
            <p> The study proposes four design principles for creating well-structured data analysis projects: Use a version control system; Documentation is essential; Be logical, obvious, and predictable; Promote (easy) reproducibility. These principles offer a framework for creating more accessible and reusable data analysis projects. The Kerblam! supports these design principles and thus simplifies data management, workflow execution, and containerization.</p>
            <p> The rationale for developing Kerblam! is clearly explained. The paper effectively establishes the need for a standardized approach to data analysis project structuring by highlighting the significant inconsistencies in current practices. The authors explicitly link this lack of standardization to challenges in reproducibility, transparency, and collaboration, which are all fundamental principles of Open Science. They then introduce Kerblam! as a direct response to these issues&#x2014;a tool designed to promote the four design principles they outlined. The connection between the identified problem and the proposed solution is therefore well-established and convincing.</p>
            <p> The technical description of the Kerblam! method is largely sound, but could benefit from some clarifications and expansions.</p>
            <p> While the paper describes the 
                <italic>functionality</italic> of Kerblam!, more details regarding its 
                <italic>implementation</italic> would strengthen the technical soundness. This could include: 1) Specifics about the Rust implementation (e.g., use of libraries, design patterns); 2) Explanation of how shell subprocesses are managed to interact with different workflow managers;</p>
            <p> Discussion of security aspects (e.g., access control, data encryption) is missing. This would be beneficial for a tool intended for scientific collaborations.</p>
            <p> The paper provides enough information to understand the core concepts and rationale behind Kerblam! and to reproduce the 
                <italic>analysis</italic> of existing project structures.</p>
            <p> The authors state that the raw data underlying their results are available on Zenodo. This is a significant strength, as it allows for full reproducibility of the analysis of existing project templates. The Zenodo link allows a reader to verify the analysis steps and the resulting frequency graph presented in the paper. This commitment to data availability significantly enhances the reproducibility of the research.</p>
            <p> The conclusions are generally supported by the findings, but a comparison with other existing tools or methods for data analysis project management would further solidify the conclusions about Kerblam!'s unique contributions.</p>
            <p>Is the rationale for developing the new method (or application) clearly explained?</p>
            <p>Yes</p>
            <p>Is the description of the method technically sound?</p>
            <p>Partly</p>
            <p>Are the conclusions about the method and its performance adequately supported by the findings presented in the article?</p>
            <p>Partly</p>
            <p>If any results are presented, are all the source data underlying the results available to ensure full reproducibility?</p>
            <p>Yes</p>
            <p>Are sufficient details provided to allow replication of the method development and its use by others?</p>
            <p>Yes</p>
            <p>Reviewer Expertise:</p>
            <p>Open Science, Information Technologies</p>
            <p>I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.</p>
        </body>
        <sub-article article-type="response" id="comment13627-363657">
            <front-stub>
                <contrib-group>
                    <contrib contrib-type="author">
                        <name>
                            <surname>Visentin</surname>
                            <given-names>Luca</given-names>
                        </name>
                        <aff>Department of Life Sciences, University of Turin, Turin, Italy</aff>
                    </contrib>
                </contrib-group>
                <author-notes>
                    <fn fn-type="conflict">
                        <p>
                            <bold>Competing interests: </bold>No competing interests were disclosed.</p>
                    </fn>
                </author-notes>
                <pub-date pub-type="epub">
                    <day>25</day>
                    <month>3</month>
                    <year>2025</year>
                </pub-date>
            </front-stub>
            <body>
                <p>We kindly thank you for your detailed review. We also appreciate your highlighting of the reproducibility of our analysis, and we hope Kerblam! was useful to you in checking it. We have submitted a second version of the article now, hoping to address at least some of your concerns.</p>
                <p> </p>
                <p> While the article is aimed at researchers somewhat familiar with technical terminology, we wanted to keep it as accessible as possible to a wide range of skill levels. For this reason, we have kept discussions of the implementation details to a minimum in the article. In any case, Kerblam! is open source, and the contributing guide (
                    <ext-link ext-link-type="uri" xlink:href="https://github.com/MrHedmad/kerblam/blob/main/CONTRIBUTING.md">https://github.com/MrHedmad/kerblam/blob/main/CONTRIBUTING.md</ext-link>) can be used as an entry point to dive into the implementation. The code is as modular as possible, and we hope that docstrings and various comments make it understandable.</p>
                <p> In the second versions of the paper, we have added a short paragraph specifying how Kerblam! handles starting other workflow managers. We are still, however, not including too many information in this regard due to the aforementioned accessibility concerns.</p>
                <p> </p>
                <p> Regarding data security and encryption, Kerblam! is an exclusively locally-executed tool that ***is*** specifically used to manage workflows, so encryption is outside of its scope. In any case, cryptographic signing of Kerblam! replay packages is a feature we are currently exploring, so that users can be sure that they are replaying the original replay package.</p>
                <p> It is indeed true that our original paper did not include any comparison with existing tools. In the second version, we have added a new section comparing Kerblam! with similar tools, highlighting its relative strengths and weaknesses.</p>
                <p> </p>
                <p> We hope you will enjoy the second version of the paper, and we are open to any technical suggestions either here or directly in the Kerblam! repository.</p>
            </body>
        </sub-article>
    </sub-article>
</article>
