ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Platforms for publishing and archiving computer-aided research

[version 1; peer review: 2 approved with reservations]
PUBLISHED 24 Nov 2014
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Computational models and methods take an ever more important place in modern scientific research. At the same time, they are becoming ever more complex, to the point that many such models and methods can no longer be adequately described in the narrative of a traditional journal article. Often they exist only as part of scientific software tools, which causes two important problems: (1) software tools are much more complex than the models and methods they embed, making the latter unnecessarily difficult to understand, (2) software tools depend on minute details of the computing environment they were written for, making them difficult to deploy and often completely unusable after a few years. This article addresses the second problem, based on the experience gained from the development and use of a platform specifically designed to facilitate the integration of computational methods into the scientific record.

Keywords

Computational models, publishing, software tools, scientific record

Introduction

In the course of a few decades, computers have become essential tools in scientific research and have profoundly changed the way scientists work with data and with theoretical models. Until now, these changes have had little impact on the scientific record, which still consists mainly of narratives published in articles that are limited in size and type of contents, and linked to each other through citations. Some particularly data-intensive fields of research also have their own digital repositories. An early example is the Protein Data Bank1, which publishes and archives structures of biological macromolecules. However, for most domains of research, no such repositories exist, and datasets are mostly neither published nor archived.

While the technology used for publishing and archiving the scientific record has shifted from the printing press and libraries to PDF files and Web servers, the kind of information that is being stored has hardly changed. Some scientific journals offer the possibility of submitting “supplementary material” with articles, as a way to circumvent the habitual length restrictions, and for providing unprintable information such as videos. In principle, the data underlying an article can be published as supplementary material as well but this remains an exception and is in fact of little practical interest. The reasons are the various restrictions on file formats and file sizes imposed arbitrarily by different scientific journals, but also the often difficult access to these electronic resources, which usually requires a careful study of each journal’s Web site. Only the recent advent of Web repositories24 and peer-to-peer networks5 for scientific data has finally made the publication of scientific data accessible to any scientist willing to do so.

The increasing number of mistakes found in published scientific findings based on non-trivial computations6,7 has made evident the necessity of making computational science more transparent by publishing software and datasets along with any descriptions of the results obtained from them. While this is now technically possible, and initiatives have been started to create incentives for scientists to invest the additional effort required for making such material available8, much more work remains to be done to ensure that published software and datasets can actually be understood, verified, and reused by other scientists. This is particularly important because computational methods are becoming an essential aspect of all scientific research, including experimental and theoretical work in which computations are not the main focus of activity. It is therefore more appropriate to discuss these issues in the context of “computer-aided research” rather than the more narrow specialty called computational science.

The most current efforts in this direction (see e.g. Ref. 911) start from the status quo of computation in science and propose small-step improvements in order to facilitate adoption by the scientific community. The work presented in this article takes the opposite approach of starting from the requirements of the scientific record and exploring how software and electronic datasets need to be prepared in order to become useful parts of this record. Both approaches are complementary: while ease of adoption is important for rapid improvement, it is also important to have a clear idea of the goal that should ultimately be reached, in order to avoid getting stuck in technological dead-ends.

The main contribution made by this work are the following insights:

  • The traditional distinction between “software” and “data” is not adapted to the needs of scientific communication. It should be replaced by a distinction between “computational tools” and “scientific contents”. Scientific contents are the information that is conserved permanently in the scientific record. It includes experimental data, theoretical models, and computational protocols.

  • Theoretical models and computational protocols include algorithms, whose permanent conservation requires a precise and stable representation with well-defined semantics.

  • Scientific contents consist of distinct information items linked by dependency and provenance information, which must be stored in the scientific record as well in order to ensure reproducibility. The same information can be used for attributing credit to everyone involved in producing the information.

  • A proof-of-concept implementation shows that these goals are attainable with the existing technology.

These insights are the result of developing a new computational framework, called ActivePapers12, and using it for several research projects in the field of biomolecular simulation. The ActivePapers framework is not the principal result of this work, and it will be described only in as much as its technical characteristics matter for the conclusions. In fact, one of the conclusions of this study is that the current ActivePapers framework does not satisfy all the requirements for integrating software and datasets into the scientific record. Nevertheless, ActivePapers is a useful tool in spite of its imperfections, and researchers interested in exploring the future of computational science are invited to use it and build on it.

The state of the art

Software and electronic datasets

The main consequence of the computerization of scientific research has been an enormous increase in the volume and complexity of scientific information. In the pre-computing era, experimental data rarely exceeded a couple of printed tables, and the description of experimental protocols and theoretical models rarely exceeded a few pages. This information could easily be recorded in text form, replicated on printed paper, and stored in libraries. Moreover, each such piece of information could be read, understood, and verified by an individual scientist with sufficient experience in the underlying domain of research.

Today’s computer-aided research is characterized by large datasets which can only be stored electronically, and processed by software that embodies theoretical models and methods. While there is no fundamental difference between a printed table and a computer file, or between an experimental protocol and a piece of software, there is a very important practical difference: the size and complexity of the electronic versions often puts them beyond the limit of what an individual scientist can understand or verify. Moreover, neither datasets nor software have traditionally been published along with the articles describing a scientific study, making verification strictly impossible even in cases of moderate size and complexity.

This situation has lead to numerous mistakes in published results based on computation, of which the identified and publicized cases6,7 are only the tip of the iceberg. In fact, these cases plus our daily experience with buggy computer software in other aspects of life should make us consider any computational result in science suspect, unless clear evidence for verification and quality assurance is provided by the authors. This evidence includes software testing, formal proof of correctness, comparison of the outcomes of independent computational studies, and proof of rigor in the non-automatic aspects of the application of computational protocols. However, the most basic requirement for building confidence in computational results is total transparency: the publication of all datasets and software used during a scientific study. This is the main goal of the Reproducible Research movement, which has been gaining traction over the last years1315. The transition to more trustworthy computer-aided research requires changes of both technological and social nature. The computational tools that scientists use must make replication not only possible, but straightforward. The maintainers of the scientific record must integrate electronic datasets and software into their archiving and publication process and reject submissions based on computational work that is not made fully transparent. But most of all, scientists must adopt a much more critical attitude towards computational results. The current tacit convention in computational science is that published results are assumed correct unless there is clear evidence suggesting a mistake. As C.A.R. Hoare famously said16, “There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies”. This observation applies equally to computational science, where the vast majority of published results have no obvious deficiencies, but are obtained using software that is much too complicated for anyone to be certain about its correctness.

The scientific record

The term “scientific record” refers to the totality of published scientific findings in history. It started to become organized in 1665 with the creation of the first scientific journal, the Philosophical Transactions of the Royal Society. A scientific journal publishes articles, which are narratives explaining the motivation for a specific study, the methods being applied, the observations made, and the conclusions drawn. The exact observations are provided in the form of tables and figures. To this day, scientific journals are the most visible part of the scientific record, although digital repositories have become an important second pillar in several domains of research. These repositories store datasets that are too big to be included as tables or figures. In addition, repositories facilitate the reuse of data in other studies, including the application of data mining techniques in meta-analysis studies.

One of the main characteristics of the scientific record is its permanence. Once an article is published in a journal, or a dataset in a database, a permanent reference is attached to it, and the publisher or database accepts the moral obligation to make the information accessible through this reference for as long as possible. Traditionally, this reference is a citation taking a more or less standardized format. With the transition to electronic publishing, the role of the permanent reference is fulfilled by a Digital Object Identifier (DOI)17, which is defined by the international standard ISO 26324:2012. For published articles, permanence also applies to the contents: once published, they cannot change. Mistakes detected after publishing can be corrected only by publishing a separate short article called “erratum”. The situation is less clear for repositories, some of which, for example the Protein Data Bank (PDB)1, do correct formal mistakes in their electronic records, but do not allow changes that would modify their scientific interpretation. More recently, the idea of versioning has been applied by some publishers, e.g. F1000Research: a DOI refers permanently to a specific article, but multiple versions of this article remain accessible permanently in order to document the evolution of the publication.

The permanence of the scientific record applies only to the preservation of the original expression of each information item, but not to its semantics. A published article can become unintelligible because of changes in terminology and in the scientists’ education. A modern physicist would not recognize the theory of classical mechanics in Newton’s “Philosophiæ Naturalis Principia Mathematica”18 without prior training in Latin and in the history of science. The longevity of electronic datasets relies on a careful documentation of the data models and data formats being used, and on proper curation of submissions by the database managers to ensure adherence to these formats. Scientific journals generally use the PDF/A format for their articles, which is a variant of the popular PDF format designed specifically for long-term archiving, defined by the international standard ISO 19005:2005. The latest generation of uncurated Web repositories24 allows any computer file to enter the scientific record without even requiring a definition of the data format. It is to be expected that much of the information in these repositories will quickly become unusable.

Software presents a particular challenge because formal data models for executable computer code are rare and in particular non-existent for the programming languages commonly used in computational science today. This is one of the key problems addressed in this work. Many practitioners consider the idea of preserving scientific software for many years unrealistic, and some even argue that it is unnecessary because computational methods change so rapidly that their long-term conservation is of no interest. The latter argument is manifestly not valid in general. As an example, the DSSP method for defining secondary-structure elements in proteins19 was published more than thirty years ago and is still widely used today. The example of DSSP is also interesting in that the most widely used software implementing DSSP today20 does not in fact implement the exact method published in the original paper. The differences are not documented anywhere at this time, and scientific papers using the modern software systematically cite the original paper without further comment. It can thus be assumed that most DSSP users are not aware of the fact that they are using a modified method. If the original method had been published in executable form and preserved until today, such discrepancies could have been avoided.

Replication and reproduction of scientific results

One of the cornerstones of scientific research is the reproducibility of scientific findings: in principle, anyone applying a published research protocol to a sufficiently similar object of study should obtain very similar results. The notion of reproducibility is necessarily imprecise. In performing experiments, the samples and environments are never exactly the same. Moreover, the description of an experimental protocol is never fully complete, because the experimenters cannot know with certainty which parameters need to be recorded. The level of similarity required in the experimental setup and in the results for the latter to be considered a successful reproduction thus varies considerably across domains of research. A reproduction attempt, whether successful or not, always yields new scientific knowledge, because it explores the impact of variations in the protocols, environments, and samples.

In computational science, the protocols, input data, and computational environment can in principle be recorded exactly, being digital information. A sufficiently complete recording of this information makes a computational study replicable: another scientist can re-run the exact same computation and obtain exactly identical results. Reproducibility, like in experimental science, refers to the less precise idea of re-doing computations based on the description of the methods in the published narrative, which is supposed to describe the “important” aspects (e.g. a numerical algorithm) but not details generally considered to be irrelevant (e.g. compiler and library version numbers).

Replicability and reproducibility in computational science play different roles in the scientific process. Replicability is part of quality assurance. If scientist B can replicate scientist A’s computation, this shows that A has provided a sufficiently precise and complete record of the original work. This is far from trivial because a precise and complete description of a computational study is a dataset that is both large and complex. Moreover, it is often difficult to obtain in today’s computing environments. Reproducibility plays the same role as in other branches of science: it establishes which aspects of a computational protocol are important for reaching specific conclusions.

However, replicability and reproducibility are not completely independent. For all but the simplest computational methods, reproducibility requires replicability. If a reproduction attempt leads to significantly different results, the cause of the differences must be explored. This is in practice only possible if the original results can at least be replicated. Otherwise, the most probable explanation for the difference is that the original study was insufficiently documented, and is thus of very limited value. But exploring the differences requires more than replicability: the original computational protocol must also be understandable by the scientist who sets out to explore the failure of a reproduction attempt. The ultimate problem is that computational methods have become very complex. A summary in the narrative is not sufficiently detailed to allow reproduction. On the other hand, the actual executable computer code is precise and complete, but even more complex than the method because it also needs to take into account complex technical issues such as performance and resource management. I have outlined solutions to this problem in Ref. 21.

Different parts of the scientific process impose different criteria for replicability and reproducibility. Conducting a study requires only short-time local replicability for quality assurance, i.e. the authors must be able to re-run their own computation in their own computational environment. Collaboration requires short-time non-local replicability, because co-workers usually have somewhat different computational environments. Pre- or post-publication peer review requires more stringent short-time non-local replicability, because reviewers are likely to have significantly different computing environments at their disposal. Peer review also requires a minimal level of reproducibility in that reviewers must at least be convinced that they could reproduce the findings if they tried, even though for lack of time they rarely do. Publication and archiving as part of the scientific record add the requirements of long-term reproducibility and thus long-term replicability. All archived data must remain usable and understandable for as long as the study remains of scientific interest, which is usually a few decades.

Tools for replicable and reproducible research

Since replicability is a purely technical aspect of computer-aided research, it can in principle be guaranteed by software tools. Many ongoing research and development projects aim to create such tools, although for pragmatic reasons the goal is rarely complete replicability, but rather some weaker requirement that is easier to achieve with existing technology. Reproducibility implies human understanding and thus cannot be ensured or even verified automatically. Nevertheless, software tools can help in the process of documenting computational methods in a way that is both replicable and understandable. In the following, I will summarize the currently most popular approaches.

An approach that has already been widely adopted in domains of research that make heavy use of computation is the conservation of datasets in digital repositories issuing permanent references. A scientific narrative is separately published in a journal, and uses the permanent reference to establish a link with the data. This link between narrative and datasets can be tightened to the point where articles no longer contain any data in the form of tables, but only permanent references to repository entries. A good example for this strategy in chemistry is described in Ref. 22. In that particular work, the software does not enter the scientific record at all, and appears only in metadata stored along with the computational results, where it identifies the software packages, version numbers, and other details of the computational environment. Other initiatives (see e.g. 23, 24) advocate storing a snapshot of the program source code in a digital repository as a dataset, in order to provide at least an archive of the exact version of the software that was used, even if it is difficult or impossible to re-run that software later. This ambivalent attitude towards software stems from the recognition of its fundamental importance on one hand and from the practical impossibility to fully integrate today’s scientific software into the scientific record on the other hand.

The computational notebook approach11, pioneered by Mathematica25 and recently popularized by the Jupyter project (formerly known as the IPython notebook)26, builds on earlier developments in literate programming27, which have also been applied to computational science directly28,29. It aims to integrate computational methods expressed as working code with input/output data and the scientific narrative. It permits a seamless transition from interactive exploratory work to a documented computational method that can be shared and published. Compared to traditional scripts, computational notebooks represent an important advance in improving reproducibility through improving human understanding. However, none of the existing notebook implementations improve on scripts in replicability, which remains local and short-term. Like a script, a notebook depends on the computational environment in which it was generated. This environment is neither conserved nor even documented in the notebook. A few years later, a notebook still provides a human-readable and rather detailed description of the method, but re-running it is likely to be difficult or impossible. Moreover, the notebook approach does not take into account datasets, unless they are small enough to be included as literal data into the notebook itself. All other data is accessed by usually nonpermanent references such as filenames or Universal Resource Locators (URLs).

Similar remarks apply to workflow management systems such as Kepler30 or VisTrails31. In fact, workflows, scripts, and notebooks all refer to the same basic concept: the outer algorithmic layer that defines a specific computational study in terms of more generic components. The differences lie in the user interface and in the kind of components that can be used (libraries, executables, Web services, etc.). Some workflow managers can archive these components partially, and also some kinds of datasets, but such support is neither complete nor exhaustive, because the technology on which today’s scientific software is built does not allow this.

The most comprehensive approach to archiving scientific software in an executable form is based on virtual machine technology3234. The authors of a computational study produce a virtual machine image that contains their complete computational environment, starting with the operating system, in addition to the problem-specific data and workflows. The resulting archives are in general too big to be archived in today’s general-purpose digital repositories. Moreover, it is not possible to refer to or re-use individual pieces of software or data inside a virtual machine image, nor is it straightforward in general to analyze the software or data except by the tools explicitly provided by the authors. But most importantly, the longevity of archived virtual machine images is uncertain. Executing such an image requires complex and sophisticated software, which for the moment is produced and maintained by non-scientific organizations for reasons completely unrelated to science. Once technological progress makes these efforts obsolete, it must be expected that computations archived as virtual machine images will become unusable.

A conclusion that can be drawn from the approaches summarized above is that today’s computational scientists cannot publish their work in a form that is at the same time executable, understandable by human readers, and reusable. Of the three basic kinds of information in computational science, software, data, and narrative, it is clearly the software that is at the root of the difficulties.

Platforms and contents

For understanding the difficulties caused by software, and thus to identify the possible solutions, it is useful to introduce the concept of a platform for scientific computing. A platform defines the interface between a computational infrastructure and computational contents. In particular, the platform defines the exact data formats that the contents must respect, and specifies how each data item will be interpreted. For example, the MP3 standard defines a platform for handling music in computers. It defines a file format for storing sound samples, and defines how an MP3 player interprets the data in such files. Any MP3 player is an implementation of the MP3 platform. Any MP3 file is a piece of contents for the MP3 platform. In general, a platform can be more complex and define formats for many different kinds of data. There are also customizable platforms that define some basic features of their contents but also a mechanism for adding more specific features. The best-known example is the XML platform, which allows working with generic structured text data, including the definition of more specific subformats through DTDs or schemas.

For a piece of software, the platform required to run it varies considerably as a function of how the software is presented. A compiled executable for the Microsoft Windows platform has very specific requirements concerning the instruction set of the real or virtual processor used to run it, but also concerning how the operating system services are accessed. In addition, its correct function may depend on specific versions of specific dynamically loadable libraries. Although most aspects of the Microsoft Windows platform are documented to some degree, there is no comprehensive documentation for the total, and its complexity makes it unlikely that such a documentation will ever be produced. In practice, only a test run can establish whether a given program works on a given machine. For software published in source code form, the requirements are very different but equally complex. Typical dependencies include a compiler or interpreter for a specific programming language, specific versions of specific libraries, and sometimes even specific files being accessible in specific locations. None of these details are documented comprehensively, which is why installation and deployment of software is very difficult. Even standardized programming languages are not defined with precise semantics, a situation that has already caused many serious problems (see e.g. Ref. 35 for the languages C and C++). This lack of a precisely defined and stable platform for executable code is also the root cause of non-replicability in computer-aided research.

Of the various attempts to remedy this situation, the best known and most successful one is the Java Virtual Machine (JVM)36, originally defined as a support for running software written in the Java language. Today the JVM hosts a variety of languages and ensures a high level of interoperability between them. The goal of the JVM developers was to enable the distribution of executable code via the Web, which users could run in their browsers without any prior installation or configuration. This goal has overall been reached successfully, and with remarkable stability: Java code written in 1995 can still be run without modification. There are only two aspects in which the JVM platform failed to attain universal portability: (1) interfacing to certain operating-system services, such as user interface layers or concurrency management, and (2) floating-point computations. The latter failure is due to a deliberate decision to give up the precise initial specification of the JVM in favor of a less rigid one that leaves more room for performance optimizations. It is still possible to use the original precise floating-point semantics, but in practice this feature is hardly used because most computer users give a higher priority to performance than to replicability.

The reasons for the JVMs success in establishing a stable software platform are various, and to a significant degree due to the interplay of the commercial strategies of the major companies in the computing market. Among the technical reasons, the main one is the choice of a data model for executable code that is situated at a higher level of abstraction than machine code, but at a lower level than typical programming languages. Machine code evolves rapidly because of progress in processor design, and programming languages evolve, somewhat less rapidly, because of advances in software engineering. Stability can only be found in between these two extremes. Other stable software platforms have adopted the same fundamental approach. In particular, the ECMA standard CLI37 can be considered a more modern implementation of the basic JVM idea. Google’s much more recent Portable Native Client (PNaCL) platform38 chose a more low-level code representation defined by the LLVM project39 and a less precisely defined computational environment, in order to facilitate the adaptation of software written in traditional programming languages. It is too early to say if this approach will turn out to be successful.

In summary, the JVM experience proves that a significant aspect of the software portability problem can be solved: it is possible to define a stable platform for executable code. The difficulties encountered with the portability of JVM code can be avoided by limiting oneself to the important subset of pure computations, i.e. software that transforms input data into output data but does not interact with its environment in any other way. This observation is important because the scientific aspects of software are always pure computations.

ActivePapers

The goal of the ActivePapers research and development project is to define a platform for publishing and archiving computer-aided research. Such a platform should ideally meet all of the following requirements:

  • A published electronic dataset, in the following called an ActivePaper, should contain all the data, code, and narrative related to a research project, with internal links among all the pieces of information that indicate dependencies and provenance.

  • An ActivePaper should be able to refer to data items in previously published ActivePapers. Such references should allow both re-use and attribution of scientific credit.

  • An ActivePaper should support large datasets by ensuring compact storage and high-performance data access.

  • The representation of executable code inside an ActivePaper should be well-defined, stable, and sufficiently simple to allow implementation on future computing systems with minimal effort. The execution of any piece of code from an ActivePaper should always produce exactly the same results at the bit level.

  • Any code stored in an ActivePaper should be safe to execute, i.e. it should not be able to cause any harm to the computing environment it is executed on.

  • An ActivePaper should contain metadata for provenance tracking and reproducibility.

It is important to note that it is not required that all software used for a computational study be stored in ActivePapers. On the contrary, it is to be expected that important software tools remain forever outside of the ActivePaper universe and work on ActivePapers as data. This includes everything requiring user interaction, from authoring tools to data visualization programs, and also highly machine-specific software such as batch execution managers. It is also possible to write external code accelerators that take code from an ActivePaper and execute it after optimization and/or parallelization, guaranteeing identical results. While the current state of the art does not provide techniques for making such code accelerators both general and efficient, it is possible and even straightforward to write problem-specific code accelerators, which are simply efficient reimplementations (in a language like Fortran or C) of algorithms stored in an ActivePaper, with the equivalence of the results verified by extensive tests.

The ActivePapers JVM edition

The original ActivePapers architecture40, which was subsequently implemented in the “ActivePapers JVM edition”, was a proof-of-concept design intended to show that it is possible with existing technology to meet all these requirements. The key design and implementation choices were

  • An ActivePaper is a file in HDF5 format41. The HDF5 format ensures flexibility, compactness, and high-performance data access.

  • HDF5 dataset attributes are used to store metadata, including a dataflow graph that records provenance.

  • Any data item inside a published ActivePaper can be referenced by the combination of the ActivePaper’s DOI and the HDF5 path to the dataset.

  • Executable code is stored as JVM bytecode. Any other code representation, in particular human-readable source code in any language, is admissible if a compiler or interpreter exists in the form of JVM bytecode.

  • The JVM security model is used to prevent executable code in an ActivePaper from accessing any data outside of the ActivePapers platform. This ensures both security (the user’s computing environment is protected) and the absence of unrecorded dependencies.

  • Individual programs inside an ActivePaper can be declared as data importers, in which case they have unrestricted read access to anything, including local files and network resources. They share the write restrictions of all other code, meaning that they cannot modify anything outside of the ActivePapers platform. Moreover, they are never run automatically, but only on explicit user request.

An implementation of the original ActivePapers platform is available from the ActivePapers Web site12. Its only dependencies are (1) a Java Virtual Machine implementation, (2) the HDF5 library, and (3) JHDF542, a Java interface to the HDF5 library. The ActivePapers software provides a command-line interface for creating ActivePapers, inspecting their contents and metadata, and for running the embedded executable code. This is clearly a minimal working environment. For production use by a wide community of computational scientists, many convenience functions would have to be added: a code and data editor, data visualization, data management, etc.

An important design decision is related to the management of the metadata that tracks dependencies and provenance. The ActivePapers platform creates and updates this metadata automatically during program execution. From the user’s point of view, an ActivePaper is a collection of datasets and programs, of which the latter can be run individually just like traditional executables or scripts. The ActivePapers platform tracks all data accesses from programs and generates the dependency graph from them. When a program is re-run, typically after modification, all the datasets it generated earlier are deleted automatically. Moreover, when a program reads data generated by another program which has been modified since it was last run, the modified program is re-run automatically to ensure coherence of all data. This automatic dependency handling has worked well in practice. It is the exact opposite of the approach taken by automation tools such as make43, which execute programs according to a manually prepared definition of the dependencies between their results.

The main difficulty with the original ActivePapers platform is the lack of scientific software compatible with its constraints. All code running inside of the ActivePapers platform must exist as JVM bytecode. All code storing data in an ActivePaper must use the HDF5 library. All code that falls into both categories, which includes in particular the workflow of a specific scientific project, must exist as JVM bytecode accessing the HDF5 library. There is almost no publicly available code that meets these requirements, mostly due to the lack of popularity of the JVM in scientific computing.

The ActivePapers Python edition

In order to gain experience with the ActivePapers approach in practice, a second implementation was developed for the Scientific Python ecosystem44. Its dependencies are the Python language45, the HDF5 library, the h5py library60 for interfacing HDF5 to Python, and the NumPy library46 which is a dependency of h5py. For the Python edition of ActivePapers, all executable code must exist in the form of Python scripts, which access the datasets through the h5py library. Libraries that contain compiled code, which are very common, cannot be placed inside an ActivePaper, but can be declared as an external dependency. This effectively means that the platform required for using an ActivePaper with such a dependency includes that library in addition to the packages listed above. Adding external dependencies is clearly not desirable from a replicability point of view, but it provides a short-term workaround to the fundamental problem that most scientific software is not ready for long-term replicability.

The Scientific Python ecosystem provides a large choice of libraries that can be used within these constraints, and the Python language is already very popular for scientific computing, making the ActivePapers Python edition a good vehicle for testing the ActivePapers approach on real research projects. On the other hand, the Python edition cannot fulfill all the requirements listed above. In particular, the Python language lacks sufficiently strong security mechanisms to implement a useful level of user protection. A more subtle problem is the stability of the platform itself. The Python language has no formal specification and in fact evolves together with its principal implementation. The scientific libraries, in particular NumPy, also evolve rather rapidly, with only moderate efforts to maintain compatibility with older versions. The ActivePapers platform records the version of all libraries that were used in the preparation of an ActivePaper, but the long-time usability of these versions is questionable, as in general only the current versions can be expected to work in current computing environments.

The Python edition of the ActivePapers platform has been used for several research projects in the field of biomolecular simulation, some of which have already been published4749. Each publication has one or more ActivePaper files deposited as supplementary material, but all the files are also available in digital repositories with DOIs. Among the published ActivePapers, there are software libraries50,51, a database of protein structures52, and combinations of datasets and code that document computational studies5355. Additional published ActivePapers contain obsolete versions of the pyMosaic library5658. These files remain permanently available because other ActivePapers depend on them. They also remain usable for as long as the underlying platform remains compatible.

One problem encountered in the course of these research projects is the relatively low size limit that today’s digital repositories impose on archived files. Zenodo4 provides the most generous limit of 2 GB per file. However, the input data for one study55 contains ten Molecular Dynamics (MD) trajectories for lysozyme in solvent, and requires 10 GB of storage even in compressed form. Since these data were not essential for the subsequent analysis step, which requires only the rigid-body motion of the protein, they were removed from the published files. The alternative would have been to publish each MD trajectory separately as an ActivePaper, and use DOI-based references in the analysis step to refer to this data. Because such references are nearly transparent to the user (the dependencies are downloaded automatically when needed), file size limits apply in practice only to individual HDF5 datasets.

ActivePapers proposes another mechanism to reduce file sizes: the deletion of recomputable datasets. Any dataset that was generated by a program stored in an ActivePaper can be replaced by a dummy dataset that retains only the dependency metadata. The full dataset can be re-computed on demand, or automatically when another program tries to read it. Recomputation consists in rerunning the program that generated the dataset initially. This mechanism makes sense only if the replicability of a dataset is guaranteed. In practice, this applies to any program that does not use floating-point operations. The latter are insufficiently specified in most of today’s programming languages, including Python, and therefore floating-point computations can produce different results when the same program is run on two different computers.

Future developments

Experience with the two current implementations of the ActivePapers idea has shown that all of the requirements defined at the outset can be fulfilled and that the approach works well in practice. In particular, the ActivePapers project has shown that installation-free software deployment and long-time software conservation are possible, contrary to a common belief in the scientific computing community. As mentioned earlier, ActivePapers can achieve these goals because the science part of software takes the form of pure computations, which are possible identically in all computational environments.

The existence of two distinct ActivePapers platforms is an historical accident that is clearly not desirable. The envisaged solution is a split of the ActivePapers platform into two parts: a data publishing system, which defines the HDF5 conventions for ActivePapers, in particular the metadata, and a code execution system that defines how specific datasets in ActivePapers are interpreted as executable code. Only the second part would differ between the current two implementations, and its separation also opens the way for additional execution systems for other code representations. There is in fact no fully satisfying code representation for scientific computations at this time, which is a strong argument for flexibility in the platform definition.

One line of future development is an integration of a narrative into the computational methods stored in an ActivePaper. Work on integrating the ActivePapers Python edition with the Jupyter project26 (formerly the IPython notebook) is underway. Unfortunately, non-fundamental technical issues make this a non-trivial project: the various components (HDF5, Python, Jupyter) have different and conflicting requirements and restrictions concerning concurrency. Aside from these software engineering issues, the main question to be solved is how to reconcile the interactivity of the notebook approach with the permanence requirements of the scientific record. The coherence of code and results in a notebook is guaranteed only if it has been executed linearly from start to end. Any interactive manipulation results in general in a non-replicable state. Two solutions are currently explored. The first solution marks notebooks as non-replicable except when executed linearly. No ActivePaper containing such non-replicable notebooks should be accepted by a digital repository. The second solution is to record all interactive code execution in a log, which can then be replayed. After a complete linear execution of the notebook code, the log of interactive executions is then deleted.

Another direction for future developments explores how to provide a realistic transition from today’s scientific computing environments to future ones that take into account the needs for publishing and archiving computations. One important advantage of the ActivePapers approach in this context is that the minimal requirements for adopting it are modest: any software tool that can work with the ActivePapers file format, which is HDF5 plus a small set of conventions, can read and write publishable datasets. With a very small additional effort, software tools can be adapted to handle ActivePapers metadata and thus ensure dependency and provenance tracking. None of this requires that the software live inside the ActivePapers platform. The challenge for future ActivePapers developments is to facilitate the transition of computational methods from subroutines hidden inside software tools to precise specifications that become part of the scientific record.

Conclusion

As I have pointed out earlier21, today’s scientific software fulfills two distinct roles: it is a tool that permits doing computations, but also the only precise and complete description of the models and methods applied in these computations. This situation is the result of the growing complexity of computational methods in science, which make the documentation of these methods in the traditional narrative of a journal article impossible. Understanding and evaluating computational science requires both the possibility to read the source code of all software and to run it on suitable input data. A useful documentation of computational science in the scientific record thus requires archiving all software parts that have an influence on computational results, in a form that can be both inspected and executed, for as long as the study remains relevant for science, which is typically several decades. The ActivePapers project has shown that these goals are achievable in principle using existing technology. It has defined two variants of a platform that gives computational methods the status of publishable content with well-defined data formats that guarantee long-term replicability. However, it has also shown that the vast majority of today’s scientific software is not easily integrated into such a platform. The main reason is that most of the computing technology used by scientists was developed outside of scientific research, for domains of application where replicability is not important.

A key ingredient in the transition from the current state of the art, in which scientific software cannot be fully archived in the scientific record, is a clear distinction between scientific models and methods on one hand, and software tools on the other hand. It is only the models and methods that need to be archived, but not the tools. The long-term usability of the models and methods is guaranteed by a complete and precise specification of their data formats, rather than by a preservation of the tools that work on them. Computational tools must in fact evolve with the progress of technology in order to remain useful to the communities that develop and apply them59. This distinction is completely analogous to how other digital content is handled. We archive articles in PDF format, movies in MPEG3 format, or protein structures in mmCIF format because these formats are well documented and allow anyone, at any point in time, to interpret the archived contents, even if today’s software tools are no longer usable because of the inherent instability of computational environments.

What distinguishes computational models and methods from articles, movies, and protein structures is their algorithmic nature, which makes them look like “software” rather than “data”. However, this distinction between software and data, although deeply ingrained in the habits of computational scientists, is not fundamental: software is just a specific kind of data, defined by the existence of some mechanism to execute it. It is very common to use software that treats other software as data, e.g. compilers, interpreters, workflow managers, debuggers, etc. In the context of scientific communication, we should treat software exactly like other kinds of data. The fundamental distinction is not “software” vs. “data”, but “computational tool” vs. “scientific content”.

Publishing and archiving scientific results has always involved an additional effort compared to keeping personal records. Experimentalists do not publish their raw lab notebooks with the user manual of their scientific equipment as an appendix. Theoreticians do not submit scans of their hand-written notes for publication. Publication always implies presenting the work that has been done and its results in a form that is understandable to and usable by other scientists. The same principle applies to computational science, whose practitioners need to be prepared to invest additional effort to make their libraries, programs, and scripts suitable for publishing.

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 24 Nov 2014
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Hinsen K. Platforms for publishing and archiving computer-aided research [version 1; peer review: 2 approved with reservations]. F1000Research 2014, 3:289 (https://doi.org/10.12688/f1000research.5773.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 24 Nov 2014
Views
57
Cite
Reviewer Report 26 Jan 2015
Mercè Crosas, Institute for Quantitative Social Sciences, Harvard University, Cambridge, MA, USA 
Vito D'Orazio, Harvard University, Cambridge, USA 
Approved with Reservations
VIEWS 57
Summary of the Paper
This article summarizes arguments in support of reproducibility for scientific research, specifically with respect to computational science. It raises important issues about scholarly communication and reproducibility of previous research work, and it presents ActivePapers as a ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Crosas M and D'Orazio V. Reviewer Report For: Platforms for publishing and archiving computer-aided research [version 1; peer review: 2 approved with reservations]. F1000Research 2014, 3:289 (https://doi.org/10.5256/f1000research.6172.r6988)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 Feb 2015
    Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France
    24 Feb 2015
    Author Response
    The article has been completely restructured in order to make its primary goal stand out: report the lessons learned from developing and using a platform for publishing and archiving computational ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 24 Feb 2015
    Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France
    24 Feb 2015
    Author Response
    The article has been completely restructured in order to make its primary goal stand out: report the lessons learned from developing and using a platform for publishing and archiving computational ... Continue reading
Views
71
Cite
Reviewer Report 09 Dec 2014
Neil Chue Hong, Software Sustainability Institute, University of Edinburgh, Edinburgh, UK 
Approved with Reservations
VIEWS 71
I, Neil Chue Hong, have reviewed this research article following the principles set out in the Open Science Peer Review Oath v1 (DOI: 10.12688/f1000research.5686.1). 

This article by Konrad Hinsen discusses the very important issue of how we capture the detail of ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Chue Hong N. Reviewer Report For: Platforms for publishing and archiving computer-aided research [version 1; peer review: 2 approved with reservations]. F1000Research 2014, 3:289 (https://doi.org/10.5256/f1000research.6172.r6774)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 24 Feb 2015
    Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France
    24 Feb 2015
    Author Response
    My goal was not to write a review article on the topic of computational reproducibility, but to present the conclusions from developing  ActivePapers and applying it in real-life research projects. ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 24 Feb 2015
    Konrad Hinsen, Centre de Biophysique Moléculaire (CNRS), France
    24 Feb 2015
    Author Response
    My goal was not to write a review article on the topic of computational reproducibility, but to present the conclusions from developing  ActivePapers and applying it in real-life research projects. ... Continue reading

Comments on this article Comments (0)

Version 3
VERSION 3 PUBLISHED 24 Nov 2014
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.