A fully featured COMBINE archive of a simulation study on syncytial mitotic cycles in Drosophila embryos

COMBINE archives are standardised containers for data files related to a simulation study in computational biology. This manuscript describes a fully featured archive of a previously published simulation study, including (i) the original publication, (ii) the model, (iii) the analyses, and (iv) metadata describing the files and their origin. With the archived data at hand, it is possible to reproduce the results of the original work. The archive can be used for both, educational and research purposes. Anyone may reuse, extend and update the archive to make it a valuable resource for the scientific community.


Introduction
In systems biology and systems medicine, the steadily increasing size and complexity of simulation studies pose additional challenges to sharing reproducible results 1 . Repeated mentions of problems with replication and reproducibility 2-4 led to new standards, tools, and methods for the transfer of reproducible simulation studies 5-9 . Several projects and initiatives already deal with reproducibility issues, such as COMBINE (co.mbine.org), FAIRDOM (fair-dom.org), and the Reproducibility Initiative (reproducibilityinitiative.org).
The Computational Modeling in Biology Network (COMBINE) coordinates the development of standard formats for various aspects of a simulation study: The Systems Biology Markup Language (SBML) 10  Today's studies consist of multiple, heterogeneous, and sometimes distributed data files, leading to the challenge of exchanging complete and thus reproducible results. To close this gap, the COMBINE community developed the COMBINE archive 8 . A COMBINE archive is a single file that aggregates all data files and information necessary to reproduce a simulation study in computational biology. The skeleton of a COMBINE archive consists of a manifest and a metadata file, specified by the Open Modeling EXchange format (OMEX).
Here we describe a fully featured COMBINE archive, which encodes an investigation of the syncytial mitotic cycles in Drosophila embryos 15 . The study published by Calzone et al. proposes a dynamical model for the molecular events underlying rapid, synchronous, syncytial nuclear division cycles in Drosophila embryos. This particular study was chosen for several reasons. Firstly, the paper, the documentation, and the related data are openly accessible. Secondly, the model is available in two standard formats: The CellML encoding is available from the Physiome Model Repository 16 at models.cellml.org/exposure/ 1a3f36d015121d5596565fe7d9afb332 and the SBML encoding is available from BioModels 17 at www.ebi.ac.uk/biomodels-main/ BIOMD0000000144. Thirdly, both model files are already curated, which increases the level of trust. Fourthly, the model describes a common biological system (cell cycle). Thus, the basic mechanisms of the encoded biology should be familiar to many researchers, reducing the effort of understanding the example.
This archive contains files that are openly available for download, as well as previously unpublished files that were generated using COMBINE-compliant software tools (see Section Materials and methods). When executed, it reproduces the original findings by Calzone et al.

Materials and methods
The fully featured COMBINE archive was created in three subsequent steps. Firstly, all available materials relating to the study were automatically retrieved from an online resource (initial archive). Secondly, the data files were organised into subdirectories, following the different aspects of a simulation study (documentation, model, experiment, result). Thirdly, missing files were manually retrieved from web resources or created using COMBINE-compliant software tools. The three steps are described in the following.
Retrieving an initial COMBINE archive The initial version of the COMBINE archive was generated using the web-based software tool M2CAT 18 Version 0.1 (m2cat.sems.unirostock.de). Among the suggested archives for the work by Calzone et al., we chose the simulation study containing a CellML model and a visualisation of the model in three different formats (PNG, SVG, AI). M2CAT automatically generated the initial COMBINE archive from these files. It also added metadata to the archive, such as annotations to creators, contributors, and modification times. M2CAT retrieved this metadata from the corresponding GIT project in the Physiome Model Repository (git log).
Organising the COMBINE archive For convenience, the files inside the COMBINE archive were structured in subfolders. The initial archive was therefor imported into the CombineArchiveWeb application (WebCAT, 9 ) Version 0.4.13 (webcat.sems.uni-rostock.de). WebCAT is a web interface to display and modify the files contained in an archive, together with metadata and file structures. The files inside the archive were organised in four directories, which reflect the different aspects of a simulation study: • documentation/: files that describe and document the model and/or experiment (empty) • model/: files that encode and visualise the biological system (4 files) • experiment/: files that encode the in silico setup of the experiment (empty) • result/: files that result from running the experiment (empty) All files in the initial archive were stored in the model/ directory. However, these files alone are not sufficient to reproduce the study.

Extending the COMBINE archive
To make the encoded study reproducible, the COMBINE archive needs to be extended with additional files.
The article is typically the central object of a research study. For this study, the original publication by Calzone et al., together with available supplementary information, was retrieved from the homepage of the journal Molecular Systems Biology (msb. embopress.org/content/3/1/131). Using WebCAT, the files were uploaded to the documentation/directory of the archive. The automatically added metadata was adjusted to attribute the authors of the publication and to state when and where the files were downloaded. In the background, WebCAT encoded the metadata in RDF/XML and added it to the archive.
The model is not only available in CellML format, but also in SBML format. The SBML file was retrieved from BioModels (www.ebi.ac.uk/biomodels-main/download?mid=BIOMD0000 000144, SBML Level 2 Version 1) and uploaded to the model/ directory. Again, the metadata was corrected to attribute the original authors, curators, and contributors, as stated on the BioModels website (www.ebi.ac.uk/biomodels-main/BIOMD0000000144) and in the model document.
The simulation description is essential to run the experiment. It defines the simulation environment and the output of the in silico execution. As no simulation description was found in any of the open repositories, an initial version was created using the SED-ML Web Tools (SWT) Version 2.1 (bqfbergmann.dyndns.org/SED-ML_Web_Tools). SWT takes the model files and creates a default simulation description with standard settings. For this study, a default SED-ML file encodes instructions to generate 66 plots and a data table. Each plot describes the change of concentration in one species of the model. The data table contains all numerical values. Based on the default script, a second SED-ML file (Calzone2007-simulation-figure-1B.xml) was generated to reassemble Figure 1B of the original publication. Using WebCAT, both SED-ML scripts were added to the experiment/ directory of the archive. The metadata for the new files was added. The simulation results reflect the behaviour of a model under certain conditions. The script defined in Calzone2007-simulation-figure-1B.xml was loaded into SWT and into the stand-alone software program COPASI Version 4.15 Build 95 19 . The plots generated by both tools show that the developed in silico experiment reproduces the results from the paper. Using WebCAT, the figures produced by SWT and COPASI were uploaded and added to the result/ directory of the archive. Metadata, such as the versions of the software tools, was added accordingly.
The visualisation of a model helps to understand the encoded biological system. For this study, an SBGN-compliant visualisation of the model was created using SBGN-ED Version 1.5.1 20 together with VANTED Version 2.1.0 21 . SBGN-ED generated an automatic layout of the uploaded SBML model, which was then improved manually. The resulting Figure 1 was exported in different formats (GraphML 22 , GML (www.fim.uni-passau.de/index. php?id=17297&L=1), PNG image, PDF, and SBGN-ML 23 ). Using WebCAT the files were uploaded to the model/sbgn directory and metadata was provided.

Data description
The archive consists of 21 files (Table 1). Among these files are the manifest.xml and the metadata.rdf, which form

Data validation
The COMBINE archive described in this data note reproduces the results of the study published by Calzone et al. To validate the reproducibility, we executed the archive in different simulation tools. For example, the encoded simulation study can be executed in COPASI, cmp. Figure 2(b). The archive can also be loaded to the SWT by opening a specific URL (bqfbergmann.dyndns.org/ SED-ML_Web_Tools/Home/SimulateUrl?url=http://scripts.sems. uni-rostock.de/getshowcase.php). The simulation results will immediately be shown in the web browser, cmp. Figure 2(c).

Conclusions
The presented COMBINE archive provides a reproducible simulation study for a previously published model on syncytial mitotic cycles in Drosophila embryos 15 . The archive contains several files that were collected from online resources, e. g. the CellML model from the Physiome Model Repository or the scientific publication from the publisher's website. It also provides new files that did not exist previously, e. g. a SED-ML file to encode the simulation setup for Figure 1B of    Author contributions MS generated the data files for the archive and designed the initial version. DW and MS wrote the manuscript.

Competing interests
No competing interests were disclosed. To achieve their aim, they use established mark-up Drosophila languages such as CellML, SBML and SED-ML, and package everything in a COMBINE archive. This COMBINE archive can then be used by anyone to reproduce the simulation experiment, as well as have access to the original paper, SBGN diagram, etc. This data note therefore shows us one possible, and viable, way to address the issue of reproducibility in computational biology and should, as such, be considered for indexing.

Major comments:
How do you envisage tackling the issue of provenance? Say that someone modifies your archive and makes it available to the community, how are we then supposed to know which one to use and what the differences are between the two versions? Considering that your COMBINE archive is on GitHub, I guess someone could always fork it, but maybe a better approach would be to take advantage of existing repositories in the community, such as BioModels.net and the Physiome Model Repository (PMR)? I am not sure about BioModels.net, but PMR inherently addresses the issue of provenance.
In their supplementary material, Calzone provide some files that can be used to reproduce et al. their different figures. So, in effect, they allow for their results to be reproduced (using XPPAUT). It would therefore make sense to have a comparison of their 'approach' to reproducibility compared with yours.
The two SED-ML files in your COMBINE archive refer to the SBML version of the Calzone model. Now, because you are also providing a CellML version of that model, it would be nice to also have a CellML-based version of your two SED-ML files.

Materials and methods:
Retrieving an initial COMBINE archive: It would be nice to know exactly what kind of search was done using M2CAT (probably one using the term "calzone"?) and, then, which COMBINE archive was used as an initial COMBINE archive.

Organising the COMBINE archive:
"The initial archive was therefor[e] imported…" (therefore). It would be nice to know what those 4 initial files were (those in the folder). model

Extending the COMBINE archive:
The article: "…to state when and [from] where…" (from).

The simulation description:
What is the point of having that default simulation experiment? To use it as a starting point for reproducing Figure 1B of the Calzone paper is fine, but I don't see the point of including that default simulation experiment in the COMBINE archive. It would be nice to know what simulation parameters and solver (incl. its parameters) were used to reproduce Figure 1B of the Calzone paper. (I imagine they are the same as the ones used by the authors with XPPAUT?) You might want to use as the sysbioapps.dyndns.org/SED-ML_Web_Tools URL for SED-ML Web Tools (rather than )? bqfbergmann.dyndns.org/SED-ML_Web_Tools It would be nice to have all the other figures of the Calzone paper also encoded in SED-ML, just so that your COMBINE archive is not only fully featured, but also complete. The visualisation of a model: Are you sure about the version of VANTED you used? Version 2.1.0 is somewhat old compared with the latest version available (version 2.6.3). Actually, looking at the contents of your COMBINE archive, I can see that Calzone2007.gml was generated using VANTED 2.6.2. (FWIW, SBGN-ED 1.6 has just been released.) The link for GML is broken ( . http://www.fim.uni-passau.de/index.php?id=17297&L=1)
Rather than referring to (in this section and scripts.sems.uni-rostock.de/getshowcase.php elsewhere in the manuscript), you might want to refer to your GitHub repository ( ) and make use of GitHub's release github.com/SemsProject/CombineArchiveShowCase feature? It might be safer in the long term.
To click on the bqfbergmann.dyndns.org/SED-ML_Web_Tools/Home/SimulateUrl?url=http://scripts.sems.uni-rostock.de/ link takes me to a page that reads "No model uploaded. You need to upload a model first prior to attempting to simulate it!" "… shown in the web browser, cmp. Figure 2(c)."?
You might want to provide a URL for COPASI and Tellurium?

Data availability:
You might want to reference your very latest commit (i.e. 6c34cc4) rather than commit a469197? Figure 1: I am not sure how useful this figure is. To me, it doesn't bring anything to your data note. Not only that, but at its original size, one cannot read anything (I personally had to view it at 400% to be able to start reading the different labels).

Figure 2:
You might want to remove the "B" in panel (a). It's confusing.
You might want to mention that the range of the X and Y axes, as well as the colour of the different plots, cannot currently be specified in SED-ML, hence panels (b) and (c) don't perfectly match panel (a)?
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. Competing Interests: The article describes a COMBINE archive of a previously published ODE model (Calzone et al) of the oscillatory dynamics of cell cycle regulatory proteins in Drosophila embryos undergoing rapid syncytial nuclear division cycles.
The files are stored on GitHub in 4 subfolders, containing the original article, model files, the figures generated from the simulations, as well as metafiles. The XML and CellML files that were originally published along with the article were modified, by adding metadata on the authors.

Issues:
There is a clear advantage to such an archive for publication of such models, reproducibility etc., but the authors do not insist on it. This type of archive could be added to publications, and provided by the authors, with a full documentation. This more general aim of the article should be more forcefully and clearly stated.
We tried to open the model file available on github (XML) using COPASI (4.16) but some errors appeared. We downloaded the initial model from Biomodels and it worked. Same issues with CellML files.
The link pointing to the website "bqfbergmann.dyndns.org/SED-ML_Web_Tools/" is broken, but if copy-pasted the figures are correctly displayed.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. The data note describes how COMBINE is used to make available a mathematical model for simulation of syncytial mitotic cycles in Drosophila. Although well written, the aim of the article is somewhat unclear to me.
The way the introduction is written and the archive is named suggest that the purpose of making the archive is to serve as a showcase for how to use COMBINE archives. However, the way the rest of the article is written focuses very much on how this specific archive was made and how it is organized. If the goal is to showcase COMBINE archives, I feel the manuscript lacks a description of not only how the specific archive was made but also why this would be a good way to make such archives in general. If, on the other hand, the aim is to make the specific model available, it lacks some more background on the specific model and possibly suggestions or examples of how it can be used.

3.
I am confused about what is to be considered the primary repository / access point. The article refers to both a GitHub repository, an omex file on figshare, and a link to a php script on their own server where one can retrieve the latest version of the omex file. This leads to a number of questions: Am I right that GitHub is the primary place where development is done and improvements will made? Is the omex file provided by the php script always up-to-date with the GitHub repository, or may the newest version on GitHub be even newer than the latest version made available as an omex file? Will the omex file on figshare be updated, or will there only be a v1 there? Considering that figshare has both versioning and API for submission, I think the best solution would be to abandon the php script. This would eliminate redundancy, as the omex file would only be on figshare. Since figshare has both versioned and version-less DOIs for datafiles, this would allow the authors to provide a stable DOI that always points to the latest version of the omex file and at the same time allow users to always cite the specific versioned DOI of the omex file they used in their work. Given that figshare has a submission API, it should be possible to make sure that the omex file is automatically updated based on the GitHub repository whenever needed.
The URL "www.fim.uni-passau.de/index.php?id=17297&L=1" leads me to an error page irrespective of whether I copy it or click it.

Minor corrections:
The authors several times use the abbreviation "cmp." to refer to other parts/figures in the article. I have never encountered this abbreviation before and was unable to find any other articles that use it. Given the context, I suspect "cf." may be what is meant.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
No competing interests were disclosed. Competing Interests: