Home Browse biogitflow: development workflow protocols for bioinformatics pipelines...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Brief Report

biogitflow: development workflow protocols for bioinformatics pipelines with git and GitLab

[version 1; peer review: 1 approved with reservations]

Choumouss Kamoun^1-4^*, Julien Roméjon^1-4^*, Henri de Soyres^1-4, Apolline Gallois^1-4, Elodie Girard^1-4, Philippe Hupé ^1-5

Choumouss Kamoun^1-4^*, Julien Roméjon^1-4^*, [...] Henri de Soyres^1-4, Apolline Gallois^1-4, Elodie Girard^1-4, Philippe Hupé ^1-5

^* Equal contributors

PUBLISHED 22 Jun 2020

Author details Author details

¹ Institut Curie, Paris, F-75005, France
² U900, Inserm, Paris, F-75005, France
³ PSL Research University, Paris, France
⁴ Mines Paris Tech, Fontainebleau, F-77305, France
⁵ UMR144, CNRS, Paris, F-75005, France

Choumouss Kamoun
Roles: Conceptualization, Validation, Writing – Review & Editing

Julien Roméjon
Roles: Conceptualization, Validation, Writing – Review & Editing

Henri de Soyres
Roles: Conceptualization, Writing – Review & Editing

Apolline Gallois
Roles: Validation, Writing – Review & Editing

Elodie Girard
Roles: Validation, Writing – Review & Editing

Philippe Hupé
Roles: Conceptualization, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Software and Hardware Engineering gateway.

Abstract

The use of a bioinformatics pipeline as a tool to support diagnostic and theranostic decisions in the healthcare process requires the definition of detailed development workflow guidelines. Therefore, we implemented protocols that describe step-by-step all the command lines and actions that the developers have to follow. Our protocols capitalized on two powerful and widely used tools: git and GitLab. They address two use cases: a nominal mode to develop a new feature in the bioinformatics pipeline and a hotfix mode to correct a bug that occurred in the production environment. The protocols are available as a comprehensive documentation at https://biogitflow.readthedocs.io and the main concepts, steps and principles are presented in this report.

Keywords

development workflow, bioinformatics pipeline, quality management, healthcare

Corresponding authors: Choumouss Kamoun, Philippe Hupé

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by the project French Bioinformatic Network for NGS Cancer Diagnosis, funded by the Institut National du Cancer (INCA).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2020 Kamoun C et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Kamoun C, Roméjon J, de Soyres H et al. biogitflow: development workflow protocols for bioinformatics pipelines with git and GitLab [version 1; peer review: 1 approved with reservations]. F1000Research 2020, 9:632 (https://doi.org/10.12688/f1000research.24714.1) First published: 22 Jun 2020, 9:632 (https://doi.org/10.12688/f1000research.24714.1) Latest published: 19 Feb 2021, 9:632 (https://doi.org/10.12688/f1000research.24714.3)

Introduction

The importance of best practices for bioinformatics analysis and software have been highlighted by many authors who proposed very valuable guidelines for better reproducibility and traceability (Georgeson et al., 2019; Hamburg & Mogyorodi, 2019; Noble, 2009; Sandve et al., 2013). However, reproducing an analysis is often a challenge (Kim et al., 2018). Therefore, guidelines have to be promoted by the computational labs and bioinformatics core facilities to federate the bioinformaticians across common practices for software development. This is especially essential when the bioinformatics pipeline is used in precision medicine to support the diagnostic and theranostic activities. Indeed, many hospitals worldwide use High-Troughput Sequencing in routine clinical practice to guide the therapeutic decision. This new era of genomic medicine is even promoted in healthcare systems at a large scale within several national initiatives, such as France, USA, UK or Australia (Stark et al., 2019).

This evolution has brought Bioinformatics at the forefront of the healthcare process with the bioinformatics pipeline being a fully integrated component of the clinical decision. Compliance with the healthcare laboratory accreditation standards and regulations is thus required for the development and exploitation of the bioinformatics pipeline. In this context, several authors (Hume et al., 2019; INCa, 2018; Matthijs et al., 2016; Roy et al., 2018) recommended in their guidelines that an appropriate code repository tool should be used to enforce version control. It ensures to track the different releases of the bioinformatics pipeline, their validation and the developers involved in their implementation. This must be integrated in a quality management process with standardized protocols approved for a diagnostic use.

There is a wide ecosystem of tools (see Perez-Riverol et al., 2016; Riesch et al., 2020; Zolkifli et al., 2018) that can support the implementation of such protocols. Among them, we capitalized on git for the version-control system as it became a very popular tool in the software engineering community and GitLab for the repository manager as it can be self-hosted. Both tools offer a large set of very powerful functionalities that can be used and combined in multiple ways thus increasing their usage complexity. It is thus mandatory to formalize through detailed guidelines how to use them on a daily basis for the development and deployment of the bioinformatics pipeline. Therefore, we implemented the set of protocols biogitflow (Kamoun et al., 2020) that describes step-by-step all the command lines and actions to be performed by the developers. A comprehensive documentation available at https://biogitflow.readthedocs.io provides all the technical details. We introduce here the main concepts, steps and principles.

Methods

Development workflow

The development workflow consists of four main steps. The first step is software development, which includes code writing and testing. The second step is acceptance testing by the end-users who validate that the expected functionalities have been correctly implemented. The third step is check the installation process and new testing to ensure that the bioinformatics pipeline can be installed in a similar environment than the one used in production. During this step, a new testing is performed such that bugs can be corrected before installing the bioinformatics pipeline in production. Finally, the fourth step is production deployment. During this last step, the new release of the bioinformatics pipeline with the new functionalities is installed in the production environment.

Multiple deployment environments

The four different steps have to be performed in separated environments in order to i) ensure that a stable version can be used in production, ii) allow the end-users to validate a new release without any impact on both the version used in production or the version under development and iii) allow the software developers to add new functionalities and modify the code without any impact on the end-users who are validating a new version and/or using the current version in production. Therefore, three deployment environments are used: a development (dev), a validation for pre-production (valid) and a production environment (prod). Besides these environments, each developer can deploy the bioinformatics pipeline in a local workspace to test the new functionalities that have been developed.

Version control and branching model

Our biogitflow protocols capitalized on the gitflow model proposed by Driessen (2010). The management of the different bioinformatics pipeline versions is based on four different git branches. Depending on the context and the step of the development workflow the following branches on the remote repository are used:

devel contains the code of the current version under development.

release contains the code with both candidate and official releases. The release branch comes from the devel branch.

hotfix is a mirror of the release branch and is used to patch the code that is in production. If a critical bug occurs in production, this branch is used to fix the issue.

master is only used to archive the code from the release and hotfix branches. This branch is not used for development.

Among these four branches, the release, hotfix and master are protected branches such that only the developers with the Maintainer role in the GitLab repository can directly push code on these branches (the other users have to use the GitLab merge request functionality to push on protected branches).

In addition to these four branches, the developer can create local branches to i) implement a new feature (the branch is named with a prefix feature plus any meaningful suffix) and ii) fix a bug in production or resolve a problem during the third step (these branches are either named with the prefix release or hotfix depending on the use case plus some relevant contextual information).

User roles and permissions

Two levels of roles and permissions are considered. Firstly, in the GitLab remote repository a developer is either assigned to the role Developer (D) to push the developments only on the non-protected branches or Maintainer (M) to push the developments on any branches. Secondly, in the deployment environments a user is either granted with the permission UD to deploy in the dev environment only or UVP to deploy in the valid and prod environments (any user with the UVP permissions is also granted with the UD permissions).

Bioinformatics pipeline testing

Testing the bioinformatics pipeline occurs during all the steps of the development workflow. It involves all stakeholders including the developers, the users in charge of the deployments and the end-users. This should be done as often as possible to identify and resolve any unexpected behavior early in the development process. The International Software Testing Qualifications Board provides comprehensive guidelines for software testing (Hamburg & Mogyorodi, 2019). Among them, we strongly recommend to include the following testing. The unit testing confirms that a piece of code provides the expected output according to the input parameters. The integration testing checks that the interfaces of the different bioinformatics pipeline components are consistent with each other and that the result of their integration allows the expected functionalities to be performed. The system (or functional) testing validates that the full bioinformatics pipeline works and fits well the end-user’s needs. The regression testing checks that the correction of bugs or the development of new functionalities did not introduce defects in unchanged areas of the bioinformatics pipeline. In addition, we highlight the importance of the operational testing to check that the bioinformatics pipeline provides the expected results on a reference dataset (golden dataset) in the production environment. This testing is systematically launched prior to any new analysis by the bioinformatics pipeline in the production environment. This ensures that the results are reproducible as long as the exact same version of the bioinformatics pipeline is used.

Results

Two use cases have been addressed with dedicated protocols that detail the different actions step-by-step. The first one is the nominal mode (Figure 1) in which a new feature is implemented to improve the bioinformatics pipeline based on requirements and expectations from the end-users who operate it for their daily clinical practice. The second one is the hotfix mode (Figure 2) in which there is a bug in the bioinformatics pipeline in the production environment that hampers the delivery of the results for the patients. The biogitflow documentation provides all the technical details, on how to configure the remote repository in GitLab to develop a new bioinformatics pipeline, how to use git and GitLab depending on the roles and permissions during the time frame of the development workflow for both use cases.

Figure 1. biogitflow protocol for the nominal mode.

This graphical synopsis provides an overview of the different steps of the development workflow when a new feature is implemented in the bioinformatics pipeline according to the role and permissions of the developer. It describes the different git actions that are performed on the different branches and the deployments in the different environments.

Figure 2. biogitflow protocol for the hotfix mode.

This graphical synopsis provides an overview of the different steps of the development workflow when a bug occurred in the production environment in the bioinformatics pipeline according to the role and permissions of the developer. It describes the different git actions that are performed on the different branches and the deployments in the different environments.

Whatever the use case, the bioinformatics pipeline is deployed in production and operated by the end-users once it has successfully passed all the testing. Whenever new patient data has to be analyzed to deliver results for diagnostic or theranostic purposes, the bioinformatics pipeline can be launched by the end-users only if the operational testing reproduces the results from the golden dataset, otherwise it is blocked by internal control mechanism. In this case, the developers investigate the reason of the failure and have to fix it using the hotfix mode.

As a real example, we provided a biogitflow template of the repository that is available for any registered user at GitLab.com. The user can fork the repository in a personal workspace and apply the nominal mode (Figure 1) and hotfix mode (Figure 2) following the proposed protocoles.

Discussion

Writing these protocols required some feedback from the developers in order to decide what was the best way to capitalize on git and GitLab taking into account our internal constraints and organization. The need of formalization required by the laboratory accreditation agency such that a bioinformatics pipeline can be used in healthcare was a great catalyst to implement these protocols. It was a major step to improve our daily practice of software engineering towards better quality management and was a source of motivation to change our work habits.

As a corollary of these protocols was the necessity to train the users involved in the development workflow to demonstrate to the laboratory accreditation agency that the technical protocols are known, understood and mastered by the developers. Therefore, an internal accreditation process of our developers was implemented. The training consists of a series of exercises that cover all the different use cases, roles and permissions of the protocols. The exercises are first realized in pair with the tutor and the trainee, then the trainee performs the exercises alone twice, and finally the trainee performs the exercises alone but in the presence of the tutor who ask additional questions to ensure that the tricky parts of the protocols are understood. To be endorsed, the trainee has to perform the exercises fluently in full autonomy. All the actions of the training process are tracked in a dedicated GitLab remote repository. Endorsement is valid for one year that can be extended if the developer still masters the protocols. This is assessed during a dedicated yearly interview with the tutor.

Conclusions

We described two protocols that are used on a daily practice to develop bioinformatics pipelines compliant with the accreditation standards for healthcare. While some choices were made to match our internal constraints and organization, the protocols can be easily transposed in other institutes as the main concepts, steps and principles hold in most of the contexts. According to the principle of the Deming wheel of continuous improvement in quality management, the protocols we described are intended to evolve in order to address future requirements, handle new risk management, integrate new tools or technical frameworks offering better efficiency of the overall process. We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities.

Data availability

Underlying data

All data underlying the results are available as part of the article and no additional source data are required.

Extended data

biogitflow documentation is available at: https://biogitflow.readthedocs.io/.

Archived source of the biogitflow documentation at time of publication: https://doi.org/10.5281/zenodo.3885463

License: CeCILL Version 2.1.

Author contributions

CK, HDS, JR and PH conceived the protocols. PH wrote the protocols and the manuscript that were reviewed by AG, CK, EG, HDS and JR. PH supervised the study.

Acknowledgments

We thank the French Bioinformatic Network for NGS Cancer Diagnosis and the Institut National du Cancer (INCA) for the fruitful discussions.

Faculty Opinions recommended

References

Driessen V: A successful git branching model. 2010. Reference Source
Georgeson P, Syme A, Sloggett C, et al.: Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. GigaScience. 2019; 8(9): giz109. PubMed Abstract | Publisher Full Text | Free Full Text
Hamburg M, Mogyorodi G: Standard glossary of terms used in software testing. Technical report, International Software Testing Qualifications Board. 2019. Reference Source
Hume S, Nelson TN, Speevak M, et al.: CCMG practice guideline: laboratory guidelines for next-generation sequencing. J Med Genet. 2019; 56(12): 792–800. PubMed Abstract | Publisher Full Text | Free Full Text
INCa: Conception de logiciels pour le diagnostic clinique par séquençage haut-débit. Technical report, collection Outils pour la pratique. 2018. Reference Source
Kamoun C, Roméjon J, De Soyres H, et al.: bioinfo-pf-curie/biogitflow: version-1.0.1. Zenodo. 2020. Publisher Full Text
Kim YM, Poline JB, Dumas G: Experimenting with reproducibility: a case study of robustness in bioinformatics. GigaScience. 2018; 7(7): giy077. PubMed Abstract | Publisher Full Text | Free Full Text
Matthijs G, Souche E, Alders M, et al.: Guidelines for diagnostic next-generation sequencing. Eur J Hum Genet. 2016; 24(10): 1515. PubMed Abstract | Publisher Full Text | Free Full Text
Noble WS: A quick guide to organizing computational biology projects. PLoS Comput Biol. 2009; 5(7): e1000424. PubMed Abstract | Publisher Full Text | Free Full Text
Perez-Riverol Y, Gatto L, Wang R, et al.: Ten simple rules for taking advantage of git and github. PLoS Comput Biol. 2016; 12(7): e1004947. PubMed Abstract | Publisher Full Text | Free Full Text
Riesch M, Nguyen TD, Jirauschek C: bertha: Project skeleton for scientific software. PLoS One. 2020; 15(3): e0230557. PubMed Abstract | Publisher Full Text | Free Full Text
Roy S, Coldren C, Karunamurthy A, et al.: Standards and guidelines for validating next-generation sequencing bioinformatics pipelines: A joint recommendation of the association for molecular pathology and the college of american pathologists. J Mol Diagn. 2018; 20(1): 4–27. PubMed Abstract | Publisher Full Text
Sandve GK, Nekrutenko A, Taylor J, et al.: Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013; 9(10): e1003285. PubMed Abstract | Publisher Full Text | Free Full Text
Stark Z, Dolman L, Manolio TA, et al.: Integrating genomics into healthcare: A global responsibility. Am J Hum Genet. 2019; 104(1): 13–20. PubMed Abstract | Publisher Full Text | Free Full Text
Zolkifli NN, Ngah A, Deraman A: Version control system: A review. Procedia Comput Sci. 2018; 135: 408–415. Publisher Full Text

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 22 Jun 2020

Author details Author details

Choumouss Kamoun
Roles: Conceptualization, Validation, Writing – Review & Editing

Julien Roméjon
Roles: Conceptualization, Validation, Writing – Review & Editing

Henri de Soyres
Roles: Conceptualization, Writing – Review & Editing

Apolline Gallois
Roles: Validation, Writing – Review & Editing

Elodie Girard
Roles: Validation, Writing – Review & Editing

Philippe Hupé
Roles: Conceptualization, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by the project French Bioinformatic Network for NGS Cancer Diagnosis, funded by the Institut National du Cancer (INCA).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (3)

version 3

Revised

Published: 19 Feb 2021, 9:632

https://doi.org/10.12688/f1000research.24714.3

version 2

Revised

Published: 08 Dec 2020, 9:632

https://doi.org/10.12688/f1000research.24714.2

version 1

Published: 22 Jun 2020, 9:632

https://doi.org/10.12688/f1000research.24714.1

© 2020 Kamoun C et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Kamoun C, Roméjon J, de Soyres H et al. biogitflow: development workflow protocols for bioinformatics pipelines with git and GitLab [version 1; peer review: 1 approved with reservations]. F1000Research 2020, 9:632 (https://doi.org/10.12688/f1000research.24714.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 22 Jun 2020

Views

Reviewer Report 22 Oct 2020

Bernard Pope, Melbourne Bioinformatics, University of Melbourne, Parkville, Vic, Australia

Approved with Reservations

https://doi.org/10.5256/f1000research.27261.r72478

Motivated by laboratory accreditation requirements for the development of bioinformatics workflows related to high throughput sequencing in a clinical setting, the authors have developed a software development process formalisation based on gitflow, a popular model in mainstream software engineering practice.

The formalisation prescribes a branching, testing, merging and release process, with set roles and permissions for developers, and methods for managing the development of new features (nominal mode) and bug fixing (hotfix mode).

In addition to describing the biogitflow workflow, the paper also contains a short section on testing bioinformatics software pipelines, however, this is largely a recapitulation of standard testing terminology, and does not appear to be particularly bioinformatics-specific.

The paper has a brief discussion about the motivation for adopting the workflow and some unsubstantiated claims about improving their software engineering practices. It also has a brief discussion about the need to train developers in the workflow and formal accreditation.

Two large figures are included in the paper that attempt to illustrate the proposed workflow in nominal and hotfix modes.

While the topic of this paper is interesting and highly relevant to bioinformatics software developers working in areas of clinical application, where process accreditation is a key concern, there are a number of areas where I think the paper could be improved:

The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.
It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?
What alternative competing development models exist?
How can the authors be sure that the workflow is in fact making a positive difference to their software development?
It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?
The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?
The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics. Genomics. Cancer. Software Development. Pipelines.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 04 Dec 2020

Philippe Hupé, Institut Curie, Paris, F-75005, France

04 Dec 2020

Author Response

First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will ... Continue reading First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will find below a detailed answer to the different issues that we have addressed in the revised manuscript.

Best regards,

Choumouss Kamoun, Julien Roméjon and Philippe Hupé

Detailed response to the reviewer1's comments

> Motivated by laboratory accreditation requirements for the development of bioinformatics workflows related to high throughput sequencing in a clinical setting, the authors have developed a software development process formalisation based on gitflow, a popular model in mainstream software engineering practice.

> The formalisation prescribes a branching, testing, merging and release process, with set roles and permissions for developers, and methods for managing the development of new features (nominal mode) and bug fixing (hotfix mode).

> In addition to describing the biogitflow workflow, the paper also contains a short section on testing bioinformatics software pipelines, however, this is largely a recapitulation of standard testing terminology, and does not appear to be particularly bioinformatics-specific.

> The paper has a brief discussion about the motivation for adopting the workflow and some unsubstantiated claims about improving their software engineering practices. It also has a brief discussion about the need to train developers in the workflow and formal accreditation.

> Two large figures are included in the paper that attempt to illustrate the proposed workflow in nominal and hotfix modes.

> While the topic of this paper is interesting and highly relevant to bioinformatics software developers working in areas of clinical application, where process accreditation is a key concern, there are a number of areas where I think the paper could be improved:

> 1. The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.

We fully agree that the protocols are complex and it is not straightforward to master them. Of note, we pointed out in our reply to you comment number 5 that the protocols are mainly relevant in the context of production to deliver to process data on the fly which is obviously a challenge especially when the bioinformatics pipelines have to deal with a very high-throughput of data. The complexity of the workflow just reflects the reality of the development and deployment cycle in the context of production as we intend to promote industrialized processes for the software implementation in order to offer better service, better reactivity, better traceability to end-users. Formalism was thus required and this is the reason why we wrote the biogitflow documentation.

You are also fully right when you say that it is "prone to error and misapplication" and that some "complexity could be avoided by automation". This is clearly an improvement according the Deming wheel as the next step would be to capitalize on additional functionalities of git/gitlab using git hooks or gitlab-CI/CD (continuous integration, continuous deployment) and gitlab Operations features for example.

Anyway, as you have mentioned, this critical analysis and limitations were absolutely not highlighted. We, therefore, added a new paragraph at the end of the discussion to explain limitations and foreseen improvements with automation.

> 2. It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?

This comparison was clearly missing. We added a new paragraph at the beginning of the discussion that explains the main differences.

> 3. What alternative competing development models exist?

Other alternatives exist such as GitHub flow and GitLab flow. We added them in the section "Version control and branching model".

> 4. How can the authors be sure that the workflow is in fact making a positive difference to their software development?

This is a good question that is difficult to answer. This would require something like an experimental design comparing two or more workflows based on the assessment of objective performance metrics, and then compare the metrics to figure out if they are statistically different. We did not perform such an approach.

However, we can say that the benefit of the biogitflow protocols was two-fold: 1/ "harmonization of the development practices" and 2/ the "constructive and positive pressure" for "better quality".These points have been added at the end of the second paragraph of the discussion.

> 5. It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?

Indeed, the target audience was not clearly mentioned. Therefore, we added in the introduction that the protocols are appropriate for software developers in the "context of production" to "support core facilities" that generate "high-throughput of data". The protocols aim at promoting better "industrialized processes". The "context of production" has been reminded in the conclusion as well.
Obviously, any bioinformatician involves in a custom development to analyze data in a research project would not straightforwardly benefit from our work : for this use case, it might just be sufficient to track the developments with git using a single branch. However, we encourage anyone to read the proposed protocols for general knowledge as it can also give ideas for more elaborated needs.

> 6. The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?

We realized that both the link to the template and its content were indeed missing. We, therefore, added the url to the template and a more detailed description that also refers to the dedicated section in the biogitflow documentation website.

> 7. The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

As mentioned in point number 5, we have detailed in the introduction who was the targeted audience and highlighted that these protocols are useful in the "context of production". We have modified the quoted sentence above highlighting again in the conclusion the usefulness of the protocols in the "context of production".
First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will find below a detailed answer to the different issues that we have addressed in the revised manuscript.

Best regards,

Choumouss Kamoun, Julien Roméjon and Philippe Hupé

Detailed response to the reviewer1's comments

> Motivated by laboratory accreditation requirements for the development of bioinformatics workflows related to high throughput sequencing in a clinical setting, the authors have developed a software development process formalisation based on gitflow, a popular model in mainstream software engineering practice.

> The formalisation prescribes a branching, testing, merging and release process, with set roles and permissions for developers, and methods for managing the development of new features (nominal mode) and bug fixing (hotfix mode).

> In addition to describing the biogitflow workflow, the paper also contains a short section on testing bioinformatics software pipelines, however, this is largely a recapitulation of standard testing terminology, and does not appear to be particularly bioinformatics-specific.

> The paper has a brief discussion about the motivation for adopting the workflow and some unsubstantiated claims about improving their software engineering practices. It also has a brief discussion about the need to train developers in the workflow and formal accreditation.

> Two large figures are included in the paper that attempt to illustrate the proposed workflow in nominal and hotfix modes.

> While the topic of this paper is interesting and highly relevant to bioinformatics software developers working in areas of clinical application, where process accreditation is a key concern, there are a number of areas where I think the paper could be improved:

> 1. The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.

We fully agree that the protocols are complex and it is not straightforward to master them. Of note, we pointed out in our reply to you comment number 5 that the protocols are mainly relevant in the context of production to deliver to process data on the fly which is obviously a challenge especially when the bioinformatics pipelines have to deal with a very high-throughput of data. The complexity of the workflow just reflects the reality of the development and deployment cycle in the context of production as we intend to promote industrialized processes for the software implementation in order to offer better service, better reactivity, better traceability to end-users. Formalism was thus required and this is the reason why we wrote the biogitflow documentation.

You are also fully right when you say that it is "prone to error and misapplication" and that some "complexity could be avoided by automation". This is clearly an improvement according the Deming wheel as the next step would be to capitalize on additional functionalities of git/gitlab using git hooks or gitlab-CI/CD (continuous integration, continuous deployment) and gitlab Operations features for example.

Anyway, as you have mentioned, this critical analysis and limitations were absolutely not highlighted. We, therefore, added a new paragraph at the end of the discussion to explain limitations and foreseen improvements with automation.

> 2. It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?

This comparison was clearly missing. We added a new paragraph at the beginning of the discussion that explains the main differences.

> 3. What alternative competing development models exist?

Other alternatives exist such as GitHub flow and GitLab flow. We added them in the section "Version control and branching model".

> 4. How can the authors be sure that the workflow is in fact making a positive difference to their software development?

This is a good question that is difficult to answer. This would require something like an experimental design comparing two or more workflows based on the assessment of objective performance metrics, and then compare the metrics to figure out if they are statistically different. We did not perform such an approach.

However, we can say that the benefit of the biogitflow protocols was two-fold: 1/ "harmonization of the development practices" and 2/ the "constructive and positive pressure" for "better quality".These points have been added at the end of the second paragraph of the discussion.

> 5. It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?

Indeed, the target audience was not clearly mentioned. Therefore, we added in the introduction that the protocols are appropriate for software developers in the "context of production" to "support core facilities" that generate "high-throughput of data". The protocols aim at promoting better "industrialized processes". The "context of production" has been reminded in the conclusion as well.
Obviously, any bioinformatician involves in a custom development to analyze data in a research project would not straightforwardly benefit from our work : for this use case, it might just be sufficient to track the developments with git using a single branch. However, we encourage anyone to read the proposed protocols for general knowledge as it can also give ideas for more elaborated needs.

> 6. The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?

We realized that both the link to the template and its content were indeed missing. We, therefore, added the url to the template and a more detailed description that also refers to the dedicated section in the biogitflow documentation website.

> 7. The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

As mentioned in point number 5, we have detailed in the introduction who was the targeted audience and highlighted that these protocols are useful in the "context of production". We have modified the quoted sentence above highlighting again in the conclusion the usefulness of the protocols in the "context of production".
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 04 Dec 2020

Philippe Hupé, Institut Curie, Paris, F-75005, France

04 Dec 2020

Author Response

First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will ... Continue reading First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will find below a detailed answer to the different issues that we have addressed in the revised manuscript.

Best regards,

Choumouss Kamoun, Julien Roméjon and Philippe Hupé

Detailed response to the reviewer1's comments

> Motivated by laboratory accreditation requirements for the development of bioinformatics workflows related to high throughput sequencing in a clinical setting, the authors have developed a software development process formalisation based on gitflow, a popular model in mainstream software engineering practice.

> The formalisation prescribes a branching, testing, merging and release process, with set roles and permissions for developers, and methods for managing the development of new features (nominal mode) and bug fixing (hotfix mode).

> In addition to describing the biogitflow workflow, the paper also contains a short section on testing bioinformatics software pipelines, however, this is largely a recapitulation of standard testing terminology, and does not appear to be particularly bioinformatics-specific.

> The paper has a brief discussion about the motivation for adopting the workflow and some unsubstantiated claims about improving their software engineering practices. It also has a brief discussion about the need to train developers in the workflow and formal accreditation.

> Two large figures are included in the paper that attempt to illustrate the proposed workflow in nominal and hotfix modes.

> While the topic of this paper is interesting and highly relevant to bioinformatics software developers working in areas of clinical application, where process accreditation is a key concern, there are a number of areas where I think the paper could be improved:

> 1. The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.

We fully agree that the protocols are complex and it is not straightforward to master them. Of note, we pointed out in our reply to you comment number 5 that the protocols are mainly relevant in the context of production to deliver to process data on the fly which is obviously a challenge especially when the bioinformatics pipelines have to deal with a very high-throughput of data. The complexity of the workflow just reflects the reality of the development and deployment cycle in the context of production as we intend to promote industrialized processes for the software implementation in order to offer better service, better reactivity, better traceability to end-users. Formalism was thus required and this is the reason why we wrote the biogitflow documentation.

You are also fully right when you say that it is "prone to error and misapplication" and that some "complexity could be avoided by automation". This is clearly an improvement according the Deming wheel as the next step would be to capitalize on additional functionalities of git/gitlab using git hooks or gitlab-CI/CD (continuous integration, continuous deployment) and gitlab Operations features for example.

Anyway, as you have mentioned, this critical analysis and limitations were absolutely not highlighted. We, therefore, added a new paragraph at the end of the discussion to explain limitations and foreseen improvements with automation.

> 2. It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?

This comparison was clearly missing. We added a new paragraph at the beginning of the discussion that explains the main differences.

> 3. What alternative competing development models exist?

Other alternatives exist such as GitHub flow and GitLab flow. We added them in the section "Version control and branching model".

> 4. How can the authors be sure that the workflow is in fact making a positive difference to their software development?

This is a good question that is difficult to answer. This would require something like an experimental design comparing two or more workflows based on the assessment of objective performance metrics, and then compare the metrics to figure out if they are statistically different. We did not perform such an approach.

However, we can say that the benefit of the biogitflow protocols was two-fold: 1/ "harmonization of the development practices" and 2/ the "constructive and positive pressure" for "better quality".These points have been added at the end of the second paragraph of the discussion.

> 5. It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?

Indeed, the target audience was not clearly mentioned. Therefore, we added in the introduction that the protocols are appropriate for software developers in the "context of production" to "support core facilities" that generate "high-throughput of data". The protocols aim at promoting better "industrialized processes". The "context of production" has been reminded in the conclusion as well.
Obviously, any bioinformatician involves in a custom development to analyze data in a research project would not straightforwardly benefit from our work : for this use case, it might just be sufficient to track the developments with git using a single branch. However, we encourage anyone to read the proposed protocols for general knowledge as it can also give ideas for more elaborated needs.

> 6. The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?

We realized that both the link to the template and its content were indeed missing. We, therefore, added the url to the template and a more detailed description that also refers to the dedicated section in the biogitflow documentation website.

> 7. The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

As mentioned in point number 5, we have detailed in the introduction who was the targeted audience and highlighted that these protocols are useful in the "context of production". We have modified the quoted sentence above highlighting again in the conclusion the usefulness of the protocols in the "context of production".
First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will find below a detailed answer to the different issues that we have addressed in the revised manuscript.

Best regards,

Choumouss Kamoun, Julien Roméjon and Philippe Hupé

Detailed response to the reviewer1's comments

> Motivated by laboratory accreditation requirements for the development of bioinformatics workflows related to high throughput sequencing in a clinical setting, the authors have developed a software development process formalisation based on gitflow, a popular model in mainstream software engineering practice.

> The formalisation prescribes a branching, testing, merging and release process, with set roles and permissions for developers, and methods for managing the development of new features (nominal mode) and bug fixing (hotfix mode).

> In addition to describing the biogitflow workflow, the paper also contains a short section on testing bioinformatics software pipelines, however, this is largely a recapitulation of standard testing terminology, and does not appear to be particularly bioinformatics-specific.

> The paper has a brief discussion about the motivation for adopting the workflow and some unsubstantiated claims about improving their software engineering practices. It also has a brief discussion about the need to train developers in the workflow and formal accreditation.

> Two large figures are included in the paper that attempt to illustrate the proposed workflow in nominal and hotfix modes.

> While the topic of this paper is interesting and highly relevant to bioinformatics software developers working in areas of clinical application, where process accreditation is a key concern, there are a number of areas where I think the paper could be improved:

> 1. The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.

We fully agree that the protocols are complex and it is not straightforward to master them. Of note, we pointed out in our reply to you comment number 5 that the protocols are mainly relevant in the context of production to deliver to process data on the fly which is obviously a challenge especially when the bioinformatics pipelines have to deal with a very high-throughput of data. The complexity of the workflow just reflects the reality of the development and deployment cycle in the context of production as we intend to promote industrialized processes for the software implementation in order to offer better service, better reactivity, better traceability to end-users. Formalism was thus required and this is the reason why we wrote the biogitflow documentation.

You are also fully right when you say that it is "prone to error and misapplication" and that some "complexity could be avoided by automation". This is clearly an improvement according the Deming wheel as the next step would be to capitalize on additional functionalities of git/gitlab using git hooks or gitlab-CI/CD (continuous integration, continuous deployment) and gitlab Operations features for example.

Anyway, as you have mentioned, this critical analysis and limitations were absolutely not highlighted. We, therefore, added a new paragraph at the end of the discussion to explain limitations and foreseen improvements with automation.

> 2. It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?

This comparison was clearly missing. We added a new paragraph at the beginning of the discussion that explains the main differences.

> 3. What alternative competing development models exist?

Other alternatives exist such as GitHub flow and GitLab flow. We added them in the section "Version control and branching model".

> 4. How can the authors be sure that the workflow is in fact making a positive difference to their software development?

This is a good question that is difficult to answer. This would require something like an experimental design comparing two or more workflows based on the assessment of objective performance metrics, and then compare the metrics to figure out if they are statistically different. We did not perform such an approach.

However, we can say that the benefit of the biogitflow protocols was two-fold: 1/ "harmonization of the development practices" and 2/ the "constructive and positive pressure" for "better quality".These points have been added at the end of the second paragraph of the discussion.

> 5. It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?

Indeed, the target audience was not clearly mentioned. Therefore, we added in the introduction that the protocols are appropriate for software developers in the "context of production" to "support core facilities" that generate "high-throughput of data". The protocols aim at promoting better "industrialized processes". The "context of production" has been reminded in the conclusion as well.
Obviously, any bioinformatician involves in a custom development to analyze data in a research project would not straightforwardly benefit from our work : for this use case, it might just be sufficient to track the developments with git using a single branch. However, we encourage anyone to read the proposed protocols for general knowledge as it can also give ideas for more elaborated needs.

> 6. The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?

We realized that both the link to the template and its content were indeed missing. We, therefore, added the url to the template and a more detailed description that also refers to the dedicated section in the biogitflow documentation website.

> 7. The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

As mentioned in point number 5, we have detailed in the introduction who was the targeted audience and highlighted that these protocols are useful in the "context of production". We have modified the quoted sentence above highlighting again in the conclusion the usefulness of the protocols in the "context of production".
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 3

VERSION 3 PUBLISHED 22 Jun 2020

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 3 (revision) 19 Feb 21	read
Version 2 (revision) 08 Dec 20	read	read
Version 1 22 Jun 20	read

Bernard Pope, University of Melbourne, Parkville, Australia
Julia Ponomarenko, The Barcelona Institute of Science and Technology, Barcelona, Spain

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

9 Views

01 Mar 2021 | for Version 3

Bernard Pope, Melbourne Bioinformatics, University of Melbourne, Parkville, Vic, Australia

9 Views Cite this report Responses(0)

Approved

I am happy that the authors have addressed all of my comments and suggestions from the previous version.

A few small typos remain and the article would benefit from a thorough proof read, but they are only minor issues.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics. Genomics. Cancer. Software Development. Pipelines.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

20 Views

21 Jan 2021 | for Version 2

Julia Ponomarenko, Center for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain

20 Views Cite this report Responses(1)

Approved

This paper presents the protocol for the development and support of bioinformatics pipelines using git and GitLab. The intended users are bioinformaticians working in a clinical setting, while bioinformaticians working in research/academia can also benefit.

This reviewer questions if the scope of the intended audience can be expanded. For example, in a clinical setting, software developed in the field of medical informatics is often used along with the software developed in biology. Would the medical informatics benefit from the proposed approach as well? Also, the first paragraph mentions only genomics sequencing as a field of application. But could the same approach at the workflow development be applied to proteomics, metabolomics and other relevant for personalized medicine and diagnostics fields?

Minor comment:

It shouldn’t belie the value of the presented protocol, but the reader and potential user might benefit and become more enthusiastic about using the biogitflow if the presented resource provided and described an example of a specific pipeline with a small test dataset.
The paper requires proofreading for typos; e.g., “As a real example, we provided a of the repository that is available …”

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

19 Feb 2021

Philippe Hupé, Institut Curie, Paris, F-75005, France

First, we would like to thank the reviewer. We are very grateful for her time, contribution and very valuable comments that significantly helped to improve the article.

You will find below a detailed answer to the different issues that we have addressed in the revised manuscript.

Best regards,

Choumouss Kamoun, Julien Roméjon and Philippe Hupé

=== Detailed response to the reviewer2's comments ===

> This reviewer questions if the scope of the intended audience can be expanded. For example, in a clinical setting, software developed in the field of medical informatics is often used along with the software developed in biology. Would the medical informatics benefit from the proposed approach as well? Also, the first paragraph mentions only genomics sequencing as a field of application. But could the same approach at the workflow development be applied to proteomics, metabolomics and other relevant for personalized medicine and diagnostics fields?

This is really true that the proposed protocols are not restricted to genomics.
We have added in the conclusion that they can be applied straightforwardly to
any omics bioinformatics pipelines and even any software development.

> It shouldn’t belie the value of the presented protocol, but the reader and potential user might benefit and become more enthusiastic about using the biogitflow if the presented resource provided and described an example of a specific pipeline with a small test dataset.

The biogitflow protocols come with a biogitflow-template that is publicly
available on GitLab.com. In this new version, we have added in the biogitflow
documentation an 'Exercises' section that first explains how the user can fork
the GitLab repository. Then, the user can practice the different exercises. As
biogitflow is mainly dedicated to the use git and GitLab and not the coding
standard or the code itself, it is difficult to provide a specific pipeline
with a test dataset. Therefore, we strongly encourage the user to start from
the template and go through the different exercises just following the
technical protocols available on the documentation.

> The paper requires proofreading for typos; e.g., “As a real example, we provided a of the repository that is available …”

Typos have been fixed.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

17 Views

19 Jan 2021 | for Version 2

Bernard Pope, Melbourne Bioinformatics, University of Melbourne, Parkville, Vic, Australia

17 Views Cite this report Responses(1)

Approved With Reservations

I thank the authors for their careful consideration of the issues raised in the previous review. I believe the changes made to the paper in response to those issues have improved the manuscript. I believe the following, relatively minor, issues should be considered in the next revision of the paper:

The abstract should mention that biogitflow is based on gitflow, a well-established workflow in the software engineering community.
It is not clear what "for any registered user" means, please clarify.
Typo: "remoterepository", should be "remote repository".
Some of the key differences between gitflow and biogitflow are enumerated, but additional explanation as to why they are different would be useful. For example, "In contrast, the release and hotfix branches have infinite lifetime in biogitflow," would benefit from a short explanation as to why this is the case.
"that makes easy" -> "that simplifies"?
"make the developer feels" (rephrase).
"As a corollary of these protocols was the necessity" (rephrase).
Are the training materials used in the accreditation process publicly available? It was not clear if they are included in the biogitflow documentation. If they are available then perhaps a link could be provided. If they are not available then perhaps the authors could consider including them in the documentation.
If possible, it would be useful to provide a citation for the "Deming wheel of continuous improvement in quality management".

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics. Genomics. Cancer. Software Development. Pipelines.

Respond to this report

Responses (1)

Author Response

19 Feb 2021

Philippe Hupé, Institut Curie, Paris, F-75005, France

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

50 Views

22 Oct 2020 | for Version 1

Bernard Pope, Melbourne Bioinformatics, University of Melbourne, Parkville, Vic, Australia

50 Views Cite this report Responses(1)

Approved With Reservations

The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.
It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?
What alternative competing development models exist?
How can the authors be sure that the workflow is in fact making a positive difference to their software development?
It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?
The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?
The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Not applicable
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics. Genomics. Cancer. Software Development. Pipelines.

Respond to this report

Responses (1)

Author Response

04 Dec 2020

Philippe Hupé, Institut Curie, Paris, F-75005, France

First, we would like to thank the reviewer. We are very grateful for his time, contribution and very valuable comments that significantly helped to improve the article.

You will find below a detailed answer to the different issues that we have addressed in the revised manuscript.

Best regards,

Choumouss Kamoun, Julien Roméjon and Philippe Hupé

Detailed response to the reviewer1's comments

> Motivated by laboratory accreditation requirements for the development of bioinformatics workflows related to high throughput sequencing in a clinical setting, the authors have developed a software development process formalisation based on gitflow, a popular model in mainstream software engineering practice.

> The formalisation prescribes a branching, testing, merging and release process, with set roles and permissions for developers, and methods for managing the development of new features (nominal mode) and bug fixing (hotfix mode).

> In addition to describing the biogitflow workflow, the paper also contains a short section on testing bioinformatics software pipelines, however, this is largely a recapitulation of standard testing terminology, and does not appear to be particularly bioinformatics-specific.

> The paper has a brief discussion about the motivation for adopting the workflow and some unsubstantiated claims about improving their software engineering practices. It also has a brief discussion about the need to train developers in the workflow and formal accreditation.

> Two large figures are included in the paper that attempt to illustrate the proposed workflow in nominal and hotfix modes.

> While the topic of this paper is interesting and highly relevant to bioinformatics software developers working in areas of clinical application, where process accreditation is a key concern, there are a number of areas where I think the paper could be improved:

> 1. The current version of the paper is lacking in critical analysis of the proposed workflow. One does not have to look far to find criticism (positive and negative) of gitflow. However, this paper does not provide any significant discussion of the pros and cons of the approach. One of the main criticisms of gitflow is that it is overly complex (for some software development projects) and thus prone to error and misapplication. This complexity is illustrated in Figures 1 and 2 and the discussion in the paper about the need to train developers in its use. Given the apparent regimented nature of the process, it seems plausible that some of the complexity could be avoided by automation. Are there any useful tools that can help developers apply the model successfully and avoid the common criticisms of gitflow? If developers are finding it difficult to follow, especially the "tricky" parts then that may be a sign of limitations in the process or limitations in the tools supporting its application.

We fully agree that the protocols are complex and it is not straightforward to master them. Of note, we pointed out in our reply to you comment number 5 that the protocols are mainly relevant in the context of production to deliver to process data on the fly which is obviously a challenge especially when the bioinformatics pipelines have to deal with a very high-throughput of data. The complexity of the workflow just reflects the reality of the development and deployment cycle in the context of production as we intend to promote industrialized processes for the software implementation in order to offer better service, better reactivity, better traceability to end-users. Formalism was thus required and this is the reason why we wrote the biogitflow documentation.

You are also fully right when you say that it is "prone to error and misapplication" and that some "complexity could be avoided by automation". This is clearly an improvement according the Deming wheel as the next step would be to capitalize on additional functionalities of git/gitlab using git hooks or gitlab-CI/CD (continuous integration, continuous deployment) and gitlab Operations features for example.

Anyway, as you have mentioned, this critical analysis and limitations were absolutely not highlighted. We, therefore, added a new paragraph at the end of the discussion to explain limitations and foreseen improvements with automation.

> 2. It is not clear how significantly biogitflow differs from gitflow. In what sense does the proposed workflow differ from established software engineering practices?

This comparison was clearly missing. We added a new paragraph at the beginning of the discussion that explains the main differences.

> 3. What alternative competing development models exist?

Other alternatives exist such as GitHub flow and GitLab flow. We added them in the section "Version control and branching model".

> 4. How can the authors be sure that the workflow is in fact making a positive difference to their software development?

This is a good question that is difficult to answer. This would require something like an experimental design comparing two or more workflows based on the assessment of objective performance metrics, and then compare the metrics to figure out if they are statistically different. We did not perform such an approach.

However, we can say that the benefit of the biogitflow protocols was two-fold: 1/ "harmonization of the development practices" and 2/ the "constructive and positive pressure" for "better quality".These points have been added at the end of the second paragraph of the discussion.

> 5. It would be useful to clarify the intended audience of the paper. How can they apply and benefit from this work?

Indeed, the target audience was not clearly mentioned. Therefore, we added in the introduction that the protocols are appropriate for software developers in the "context of production" to "support core facilities" that generate "high-throughput of data". The protocols aim at promoting better "industrialized processes". The "context of production" has been reminded in the conclusion as well.
Obviously, any bioinformatician involves in a custom development to analyze data in a research project would not straightforwardly benefit from our work : for this use case, it might just be sufficient to track the developments with git using a single branch. However, we encourage anyone to read the proposed protocols for general knowledge as it can also give ideas for more elaborated needs.

> 6. The discussion of the provided biogitflow templates sounds useful, but the discussion in the paper is unclear. Perhaps this could be expanded upon?

We realized that both the link to the template and its content were indeed missing. We, therefore, added the url to the template and a more detailed description that also refers to the dedicated section in the biogitflow documentation website.

> 7. The paper ends with "We strongly encourage the promotion of such protocols not only for the healthcare activities but also for research activities." However, due to the aforementioned lack of critical analysis, it is not clear from the paper why this is true.

As mentioned in point number 5, we have detailed in the introduction who was the targeted audience and highlighted that these protocols are useful in the "context of production". We have modified the quoted sentence above highlighting again in the conclusion the usefulness of the protocols in the "context of production".

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

biogitflow: development workflow protocols for bioinformatics pipelines with git and GitLab

Abstract

Keywords

Introduction

Methods

Development workflow

Multiple deployment environments

Version control and branching model

User roles and permissions

Bioinformatics pipeline testing

Results

Figure 1. biogitflow protocol for the nominal mode.

Figure 2. biogitflow protocol for the hotfix mode.

Discussion

Conclusions

Data availability

Underlying data

Extended data

Author contributions

Acknowledgments

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated