Ten recommendations for organising bioimaging data for archival

Paul K. Korir; Andrii Iudin; Sriram Somasundharam; Simone Weyand; Osman Salih; Matthew Hartley; Ugis Sarkans; Ardan Patwardhan; Gerard J. Kleywegt

doi:10.12688/f1000research.129720.1

Home Browse Ten recommendations for organising bioimaging data for archival

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Opinion Article

Ten recommendations for organising bioimaging data for archival

[version 1; peer review: 4 approved with reservations]

Paul K. Korir¹, Andrii Iudin¹, Sriram Somasundharam¹, [...] Simone Weyand¹, Osman Salih¹, Matthew Hartley¹, Ugis Sarkans¹, Ardan Patwardhan¹, Gerard J. Kleywegt ¹

Paul K. Korir¹, Andrii Iudin¹, [...] Sriram Somasundharam¹, Simone Weyand¹, Osman Salih¹, Matthew Hartley¹, Ugis Sarkans¹, Ardan Patwardhan¹, Gerard J. Kleywegt ¹

PUBLISHED 23 Oct 2023

Author details Author details

¹ EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK

Paul K. Korir
Roles: Conceptualization, Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Andrii Iudin
Roles: Data Curation, Writing – Review & Editing

Sriram Somasundharam
Roles: Writing – Review & Editing

Simone Weyand
Roles: Data Curation, Writing – Review & Editing

Osman Salih
Roles: Data Curation, Writing – Review & Editing

Matthew Hartley
Roles: Writing – Review & Editing

Ugis Sarkans
Roles: Writing – Review & Editing

Ardan Patwardhan
Roles: Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing

Gerard J. Kleywegt
Roles: Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the ELIXIR gateway.

Abstract

Organised data is easy to use but the growth of bioimaging, with improvements in instrumentation, detectors, software and experimental techniques has resulted in an explosion in the volumes of data being generated, making this an elusive goal. This guide offers a handful of recommendations whose implementation would contribute towards better organised data in preparation for archival. Based on our experience archiving large image datasets in EMPIAR, the BioImage Archive and BioStudies, we propose a number of strategies that we believe would make future data depositions more useful to the bioimaging community and that may also find use in other data-intensive disciplines. To facilitate the process of analysing data organisation, we present bandbox, a Python package that provides users with an assessment of their data by flagging potential issues that could be addressed before archival.

Keywords

Organising data, public archiving, data deposition, open data, bioimaging, EMPIAR, BioImage Archive, BioStudies

Corresponding authors: Ardan Patwardhan, Gerard J. Kleywegt

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by UKRI-MRC and UKRI-BBSRC (grants MR/L007835/1 and MR/P019544/1), the Wellcome Trust (grant 221371/Z/20/Z), and EMBL through contributions from its member states.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Korir PK et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Korir PK, Iudin A, Somasundharam S et al. Ten recommendations for organising bioimaging data for archival [version 1; peer review: 4 approved with reservations]. F1000Research 2023, 12(ELIXIR):1391 (https://doi.org/10.12688/f1000research.129720.1) First published: 23 Oct 2023, 12(ELIXIR):1391 (https://doi.org/10.12688/f1000research.129720.1) Latest published: 27 Feb 2024, 12(ELIXIR):1391 (https://doi.org/10.12688/f1000research.129720.2)

Introduction

Resources such as EMPIAR (Iudin et al., 2016; Iudin et al., 2023) and the BioImage Archive (Ellenberg et al., 2018; Hartley et al., 2022) provide a valuable service to the life-science community by supporting the archival and reuse of imaging data, often acquired at considerable cost, in line with the aspirations of the FAIR Guiding Principles (Wilkinson et al., 2016). There are numerous advantages and benefits to reusing bioimaging data, including more economical use of limited resources such as instrumentation and highly skilled technical staff. Moreover, specimens may be unique, costly to acquire, or difficult to reproduce, meaning that such data may only be accessible via archives. Archived data can be mined for reanalysis, verification and validation, and for development of new analytical techniques and software tools, such as machine learning model training. Reuse of such data may also lead to improvements in how it is produced, both technologically and methodologically. As practitioners in bioimaging data archiving, it is our experience that handling large datasets presents several data-management challenges, particularly in recent years with the rapidly increasing volumes of bioimaging data (Ellenberg et al., 2018). For example, it took eight years for EMPIAR to archive a total of one petabyte of data, but the second petabyte took only 14 months (Iudin et al., 2023). Bioimaging datasets may comprise numerous and sometimes very large files in a variety of, sometimes proprietary, formats. Individual files may include multiple channels and time points and data and metadata from several specimens. Besides the raw image data, there may also be a need to archive processed data, reconstructed 3D volumes, segmentations, particle stacks and other derived or related data.

There are two related but distinct avenues for organising data: labelling (metadata) and arranging data items (order). Metadata are essential to make the data useful even though metadata standards are difficult to enforce. Therefore, metadata standardisation has received a lot of attention with initiatives such as Bioschemas, an effort to improve findability of datasets via standardised textual annotations, and MIAME (Brazma et al., 2001), recommendations for minimal metadata describing a microarray experiment, and the overarching FAIR Guiding Principles (Wilkinson et al., 2016). For bioimaging, REMBI (Sarkans et al., 2021) provides community-supported recommendations on how to describe all aspects of bioimaging experiments including sample preparation, data processing and analysis. Whereas there are several ongoing efforts towards standardising bioimaging data formats (OME-NGFF (Moore et al., 2021), DVID (Katz & Plaza, 2019), HDF5 (Pietzsch et al., 2015), etc.), we know of no efforts towards harmonising how to organise datasets for maximum usefulness with archival in mind. The organisation (order) of data is usually taken for granted and it falls upon refinements of the metadata to bear the burden of meaningfully describing the data.

Motivation

Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors. Depositors are best placed to present data in a way that adequately captures the experimental design and outcomes. Organising a dataset to minimally convey a structure in line with the actual experimental output can improve its usability while the bulk of meaningful attributes can be expressed in the metadata. The degree of usefulness depends directly on the quality of organisation, and thoughtful consideration of users’ needs improves that usefulness. Good organisation also gives a dataset transparency and understandability: users are able to immediately distinguish the various experimental categories as well as plan how to analyse the data. Therefore, it helps to have a clear perspective of the various classes of users.

In general, we consider three classes of users: intra-domain scientists, inter-domain scientists and extra-domain scientists (Datta et al., 2021). (For the purpose of this article, we will refer to any such user of a dataset as a ‘scientist’, interested in extracting some knowledge from the archived data.) Intra-domain scientists are familiar with key attributes of the data and may be able to quickly assess the usefulness of a dataset. An example would be a structural biologist mining an electron cryo-tomogram to extract sub-volumes that have not been previously studied. Inter-domain scientists may want to mine the data for purposes tangential to some other domain. For example, a genomicist may want to include structural analyses in a genomics study and may turn to raw imaging data to accomplish this. Extra-domain scientists are only interested in data for its technical properties, i.e., for some purpose completely unrelated to the original purpose of the data’s collection. A computer scientist, for example, may want to assess the performance of a learning algorithm on fluorescent microscopy images when performing some classification task. It is likely to be a challenge to optimise the organisation of data for all classes of users simultaneously. In practice, organising the data to be useful to scientists with the least familiarity with the domain will most likely advance its usefulness for all classes of scientists and can thus be a good aspiration.

The task of organising data consists of making trade-offs in the use of ‘ways and means’ of effecting the organisation. We will refer to these ‘ways and means’ as organisational resources. A simple example would be the use of alphabetic ordering when organising a set of strings; the natural ordering according to some alphabet is our organisational resource and we exploit the fact that most users will perceive this as a convenience when traversing the data. In less trivial organisational tasks, we need to express complex relationships between the entities to be organised. For instance, a dataset that consists of the experimental measurements resulting from a sequence of treatments on a set of specimens measured at various points in time requires the use of specimen, treatment and time point identifiers as well as other experimental aspects (data formats, alternative perspectives, transformations of the data such as changes in units, etc.) to be captured in such a way as to preserve the main experimental relationships. In that case, we can expand our set of organisational resources to include a folder hierarchy and file formats in addition to the set of symbols (letters, numerals, punctuation, literal symbols, uppercase and lowercase and so on) used to create the various identifiers. Ideally, we would like to keep repetition to a minimum so that the nature of the experiment can be readily discerned.

The manner in which organisational resources are used affects the usability of the resulting organisation: using too few of them will obscure the meaning of the organisation while using too many will overwhelm potential users. For example, including redundant folders along any part of the hierarchy (folders that contain only a single folder which in turn contains the actual data) makes it tedious to navigate through a dataset. On the other hand, dumping all files into one folder will make it difficult for the end user to distinguish between groups of semantically related files, especially when thousands of files are present. Similarly, naming files and folders by referring to entities inaccessible to its intended users (e.g., private machine names or private accession codes that external users will not have access to or even fathom) consumes precious ‘name space’ without conveying any useful information. Organising data is thus an investment of time and effort with the ultimate aim of improving the usefulness of the data.

We can therefore formulate the organisation task as follows: given a set of related data items associated with an experiment, how may they be organised to best convey their relationships using as few organisational resources as possible while maximising their usability?

To achieve this, we define the term facet to refer to the various attributes germane to the experiment which may be included in the folder and file names. A non-exhaustive list of facets are: specimen names (organism, tissue, cell type/line), experimental roles (treatments vs. controls), time (developmental status, date, elapsed time), processing status (raw data, by algorithm, procedure), generally available equipment (microscopes, detectors, preparation equipment model names), replicates, file types (3D volumes, particle stacks), names of software used for processing, and so on.

This guide attempts to solve the organisation task by providing 10 recommendations that arise from our experience of handling hundreds of large image datasets in the public archives EMPIAR, BioImage Archive and BioStudies (Sarkans et al., 2018). Ideally, we would like to organise potentially numerous and voluminous data to maximise ease of use so as to facilitate the user’s ability to:

1. quickly identify the suitability of (subsets of) the data;
2. clearly distinguish between the various facets of the data;
3. quickly verify the usefulness of the data (e.g., thumbnails, previews, summaries, READMEs, LICENCE files);
4. retrieve only relevant subsets of the data.

This guide does not offer any recommendations for a detailed schema to describe experimental and analytical procedures; those may be captured in metadata for the various archives. Neither does it describe how to decide which experimental facets are appropriate (these are part of the experimental design), nor does it attempt to describe how to achieve organisation for automated analysis (we assume that the resulting organisation will be consumed by humans). It also ignores the universe of image formats in use and mainly includes examples from our experience archiving bioimaging data, but we anticipate it may be useful across other imaging disciplines. Good organisation improves data structure and format predictability and may facilitate automated processing. Therefore, our guide is intended to lead towards best practices rather than serve as a framework. Finally, this guide does not aim to achieve standardisation. We believe it is more practical to have a set of best practices and leave it up to the data authors to decide how best to apply them.

We believe that the recommendations outlined here may be of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community, and 2) technologists (hardware, software and methods developers), who, by considering these recommendations in their designs, can greatly facilitate data organisation at the source.

To make our recommendations practical, we have developed bandbox (Korir et al., 2022), an open-source command-line interface (CLI) tool to help users understand how they can improve the organisation of their data in preparation for archival. The program offers two CLI commands: view and analyse. The view command displays a tree of a directory and all its contents; for every non-empty directory with files, bandbox provides a summary of the number of files in it, including a list of all the file formats encountered. The analyse command provides a listing of possible issues grouped into categories in line with those specified in the Recommendations section of this article. bandbox examines the tree associated with the nested hierarchy of files and folders in a dataset and then concurrently runs various heuristics on the tree which are controlled by configurations that the user may modify. The results produced by the analyse command are only suggestions for improvement; we understand that there may be practical limitations to implementing some of the suggested improvements as well as good reasons for keeping the data as is. Figures 1 and 2 show screenshots of the results of running bandbox on two different datasets.

Figure 1. Example of a dataset with organisational red flags identified by `bandbox` such as use of spaces or non-ASCII characters, redundant directories and so on with an indication of the number of such entities found.

The actual example dataset is provided with the bandbox source code (ASCII – American Standard Code for Information Interchange).

Figure 2. Example of a dataset with no red flags as inferred by `bandbox`.

The actual example dataset is provided with the bandbox source code.

Recommendations

We will motivate our guide by referring to a fictitious EMPIAR dataset. This dataset has a clear structure, but we propose that it can be further improved following the recommendations in the guide below.

Our goal is to improve the file/folder structure shown in Figure 3 to better convey the relationships between the experimental facets while economising the organisational resources available. For clarity, we have refrained from listing several thousand raw TIFF files in the folders designated ‘Raw’.

Figure 3. Illustration of some of the ways in which subtle features of data organisation impact its usability.

Issues include: 1) long file/folder names with spaces, non-ASCII characters (ö) and redundant directories (ASCII – American Standard Code for Information Interchange); 2) obscure sequences with inconsistent spelling, 3) inconsistency in folder hierarchy, 4) obscurity through meaningless symbol sequences, 5) verbosity in names, 6) subtle differences in spelling (in this case, a hyphen) and 7) inconsistency in typography due to character case and inclusion of different separator characters, e.g., spaces vs hyphens. See text for more details.

The example dataset illustrates several properties of its organisation that undermine the goal of being usable:

• Verbosity typically presented by repetition of references which may be resolved using the file hierarchy, such as:
- ○ Folders containing only a single folder which in turn contains the folder with the actual data. The child folder of ‘data’ only has the folder ‘A U Thör et al …’ in it that contains the folder ‘A folder with an overall description’ which has the actual data.
- ○ Very long names of files/folders. The full path of the file ‘data/A U Thör et al - A very long relevant title that has most of the keywords in your paper/A Folder with an overall description/0923480928 - Treatement Tr1-323 Tissue/0923480928_Treatement_Tr1323_Organelle1-topology1.zip’ is ‘0923480928 Treatement Tr1-323 Segmentation/0923480928_Treatement_Tr1323_Organelle1-topology1.zip’, which might be outside the limits of legacy software; e.g., IMOD (Mastronarde, 2006) has a limit of 320 characters for input file names.
- ○ Repetition of identifiers along the path. In the previous example, half of the files repeat the identifier ‘0923480928’ that conveys no meaningful information and which, if required at all, should only appear in the appropriate parent folder name.
• Ambiguity occurs through incomplete identifiers either due to typos or non-standard characters.
- ○ Is ‘Tr1-323’ the same as ‘Tr1323’?
- ○ Use of spaces and non-ASCII characters can make processing the data complicated because of how software may handle path names with spaces. ASCII stands for the American Standard Code for Information Interchange and consists of plain characters used in many languages.
• Inconsistency is perhaps the most common issue and is usually the result of manually introduced errors such as changes in spelling, e.g., naming similar folders ‘tomo’ and ‘tomogram’ for related files. In the above example we have:
- ○ ‘Topology’ and ‘topology’
- ○ ‘Treatment’ vs ‘Treatement’
- ○ ‘Tr1-323’ and ‘Tr1323’
- ○ Inconsistency may also be observed in folder structure. For example, only one of the treatment folders (the one with ‘3738932082’ in the name) has an extra child folder, breaking the trend of the others.
• Obscurity tends to occur through the use of identifiers with no obvious meaning, e.g., references to external resources such as figure numbers in a related paper, machine identifiers, script names, etc.
- ○ The numerical identifiers such as ‘0923480928’ have no obvious meaning in the context of the dataset.
- ○ ‘Tr1-323’ may be an external reference but its meaning is unclear.
Understandably, in certain cases such obscurity may be useful to keep identifiers which convey additional information. For example, in electron cryo-microscopy (cryo-EM) pipelines, the dataset may consist of multiple subsets obtained with different open-source software, e.g. particle picking by EMAN2 (Tang et al., 2007), beam-induced motion-correction by MotionCorr (Li et al., 2013), contrast-transfer function (CTF) correction by gCTF (Zhang, 2016), classification by RELION (Scheres, 2012), reconstruction by cryoSPARC (Punjani et al., 2017), etc.

The 10 recommendations we present below are divided into four groups: planning (one recommendation), structure (three recommendations), naming (three recommendations) and miscellaneous (three recommendations). We have provided further guidance within each group for related concepts.

Planning

(1) Design before data collection. Plan beforehand, if possible, how the data may be structured.

a. If the experimental facets are known prior to data collection, the organisation suggestions that follow below will be easier to apply once and for all; it is harder to reorganise data, especially voluminous data on multiple networked drives or in a cloud resource after collection. At a minimum, consider organising the few top-level directories in terms of the experimental facets prior to archival.
b. Consider employing a naming convention within a research group to ensure that data is consistent between creators of the data. This can even be specified in the microscope’s software to include imaging parameters in the file names automatically such as a base name, date and/or time, imaging parameters (e.g., resolution, section size) or even free text, among many others. We invite software vendors/creators who have not already done so to consider incorporating organisational concerns into their software that take these recommendations into account.

Structure

This section contains recommendations to address the hierarchical organisation of files and folders only.

(2) Containing folder. Consider having one parent folder into which all sub-datasets are located. Such a container folder is also a good location to include auxiliary data that apply to the collection such as README or integrity (see recommendation 10) files, which provide users with the context of the data organisation.

(3) Folder depth.

a. Consider limiting the folder depth to a reasonable maximum. As a rule of thumb, three to four directory levels is adequate for most applications but the fewer the better.
b. Consider excluding folders which do not convey any additional information. For example, consider a dataset having only TIFF files. Including a folder called tiff in the path <condition>/tiff/files*.tif is redundant. By contrast, if the file format is instrumental then <condition>/<format1>/<files_of_format1> and <condition>/<format2>/<files_of_format2> and so on is meaningful.
c. Consider an upper limit on the number of files in a folder and if necessary split large directories so they do not contain more than a certain maximum number of files (e.g., 10,000). If, for instance, a folder has one million files then it may instead be organised as a folder (parent_folder) with 100 sub-folders (child00 to child99), each containing 10,000 files. This is important because different file systems have different tolerances for handling large numbers of files. For example, the Second Extended Filesystem (ext2) imposes ‘soft’ limits of 10,000 files per directory because of the extra overhead when processing such large folders (The Second Extended Filesystem — The Linux Kernel Documentation). While modern file systems are capable of handling larger numbers of files, the re-usability of the data will increase when taking into account systems with more modest resources, such as web browsers that may need to list or process all files in a directory.

(4) Folder contents.

a. Consider grouping related files unless it is instrumental to keep them separated. For example, group files by specimen, filetype, experimental purpose (treatment, control), etc. It may be instrumental to separate different data types into different folders (e.g., one for micrographs and one for particle stacks). Further sub-folders may be necessary for single- and multi-frame micrographs, unaligned and aligned micrographs, etc.
b. Consider depositing data from different experimental techniques/sub-techniques as separate archive entries (e.g., single-particle data in one, tomography data in another). Most archives allow multiple separate entries to be linked or grouped.

Naming

In this section, we provide some suggestions to improve the naming of files and folders.

(5) Meaningful names.

a. Consider naming files and folders using meaningful identifiers without specifying external references. For instance, while the name ‘Figure 5’ probably refers to a paper describing (some of) the data, users will require access to that manuscript, which may be behind a paywall. The names of files and folders should exclude any references that are tied to the instrument or your organisation, which are at best unhelpful for external users.
b. Consider avoiding ambiguous attributes such as dates and times particularly in folder names. Mass renaming of files with dates and times can become non-trivial particularly if such attributes have subtle variations for related files (e.g., as date/time stamps are incremented) and is therefore best avoided.

(6) Naming symbols.

a. Consider confining names to lowercase letters and numerals and replacing all spaces with underscores or hyphens for meaningful word (group) boundaries as this makes it substantially easier when working with the data. Preferably, consider underscores only for word boundaries and hyphens for keywords or other key attributes such as specimen names identifiable by the presence of a hyphen, e.g., covid-19. Consistent use of case also improves readability (Deissenboeck & Pizka, 2006).
b. Consider avoiding certain characters which could lead to unintended consequences during processing such as ampersands (&), spaces, exclamation marks (!) and question marks (?). In general, stick to printable alphanumeric ASCII characters and avoid non-ASCII characters (e.g., ü, å or non-Roman scripts).
c. Consider avoiding periods in names as this can lead to unpredictable behaviour for instance when attempting to determine formats. For example, while it is generally well known that the file file.tar.gz has two standard extensions, it may not be as widely known that file.ome.tiff, file.ome.tf2, file.ome.tf8 and file.ome.btf are all valid multi-extension bioimaging formats (OME-TIFF Specification — OME Data Model and File Formats 6.2.2 Documentation).
d. Consider an upper limit on the length of file and folder names. We propose a working upper limit of 50 characters. Even though modern operating systems have no limitations on the lengths of names, end users will still struggle typing very long names which increases the likelihood of transcription errors. In some cases, software that is widely used by the bioimaging community imposes limits on the number of characters for file paths, e.g., IMOD (Mastronarde, 2006) imposes a file path limit of 320 characters. Bear in mind that, increasingly, users will interact with datasets via a web browser, which also has a practical limit (based on the device’s memory) on the number of files that can be selected in the browser’s select dialog.

(7) Identity.

a. Ensure consistency when naming files and folders so that similar folders at different depths have the same names.
b. Do not include personal identifiers in folder or file names.
c. Some words to consider for exclusion in the names of internal files/folders: ‘files’, ‘data’, ‘images’ etc. or other words that convey no additional meaningful information.
d. Think of folder names as applying to all the folders and files they contain as well: there should be no repetition in nested folder names, e.g., data/control.a/control.a.1/control.a.1.value/data/;
e. When providing 3D data as slices, consider zero-padding the slice identifiers which facilitates correct assembly. For example, consider an image with 1000 images at a resolution of MxN representing sections/slices of some volume; splitting this file should result in files of the form file0001.tif to file1000.tif. If zero-padding is missing or done incorrectly (file1.tif to file1000.tif), the order of slices will be lost on operating systems that apply lexicographic rather than numerical sorting. This can be fixed using the rename shell utility, e.g., rename file file00 file??.tif will convert all files with 01 to 99 to have 0001 to 0099 and so on. rename is available on most Linux distributions and may be installed on macOS using Homebrew or from the source code. On Windows systems the Bulk Rename Utility can be used.

Miscellaneous

Finally, this section includes some tips on how to handle other aspects of organisation not covered in the previous sections.

(8) Friendly file formats.

a. Consider providing images in widely used file formats unless you are demonstrating a novel file format in which case it may be necessary to first get in touch with the archive to plan this. Additional information may be requested to provide users with guidelines on how to use and visualise the new format files including any conversion tools that are available and on providing the same data in a widely used file format as well.
b. Even for file types that are widely used, it may be helpful to stick to open formats to ensure that users without access to proprietary software will have access to the data.

(9) Document your data.

a. Consider including a README text file which provides an overview of how the data is organised.
b. Consider testing the usability of your data by asking a colleague to peruse your data to assess whether the organisation is clear. This can be achieved by asking the tester to describe their understanding of what the data presents.

(10) Integrity. If possible, consider including checksums, parity codes or hashes for each data file in a separate file, e.g., md5-sums.txt, imageset01.par2 or sha512-hashes.txt to facilitate content verification. These will allow users to verify that the data has not been corrupted during the deposition or download process. Each of these different ways to verify file integrity have corresponding tools available for all operating systems, but their operation is beyond the scope of this article (Chi Lianhua & Zhu Xingquan, 2017).

Applying the recommendations above, we may revise the path:

data/A U Thör et al - A very long relevant title that has most of the keywords in your paper/A Folder with an overall description/0923480928 - Treatement Tr1-323 Tissue/0923480928_Treatement_Tr1323_Organelle1-topology1.zip’ is ‘0923480928 Treatement Tr1-323 Segmentation/0923480928_Treatement_Tr1323_Organelle1-topology1.zip

to:

data/brief_description/treatment3_tissue/segmentation/organelle1_topology1.zip,

a reduction from 328 to 79 characters for the full path. The new organisation is presented in Figure 4.

Figure 4. Tree representation of the data from Figure 3 reorganised by applying some of the 10 recommendations proposed.

Conclusion

We hope that these 10 recommendations will only be the beginning of a broader discussion on how to organise bioimaging data in particular and experimental data in general for maximum usefulness, not just to the bioimaging community, but to the wider scientific community. Given the breadth of applications of bioimaging techniques, good organisation would go a long way to helping scientists from other disciplines to benefit from using bioimaging data. There is still considerable scope to develop better ways of not only organising data, but also representing it to enable automated data analysis.

Data availability

No data are associated with this article.

Software availability

Software available from: https://pypi.org/project/bandbox

Source code available from: https://github.com/emdb-empiar/bandbox

Archived source code at time of publication: https://doi.org/10.5281/zenodo.7807541 (Korir et al., 2022).

License: Apache License 2.0

Acknowledgements

The authors are grateful to Alex J. Noble and Christopher J. Peddie for helpful feedback on the manuscript. This work aligns with the recommendations of the EuroBioimaging/ELIXIR Joint Strategy (https://elixir-europe.org/system/files/euro-bioimaging_elixir_image_data_strategy.pdf), in particular the need for standards and approaches for the organisation of image data storage in established and emerging reference image domains. We acknowledge both ELIXIR and Euro-BioImaging’s key roles in highlighting the importance of the effective organisation of biological image data.

References

Brazma A, Hingamp P, Quackenbush J, et al.: Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat. Genet. 2001; 29(4): 365–371. PubMed Abstract | Publisher Full Text
Lianhua C, Xingquan Z: Hashing Techniques. ACM Computing Surveys (CSUR). 2017. Publisher Full Text
Datta S, Lakdawala R, Sarkar S: Understanding the Inter-Domain Presence of Research Topics in the Computing Discipline. IEEE Trans. Emerg. Top. Comput. 2021; 9(1): 366–378. Publisher Full Text
Deissenboeck F, Pizka M: Concise and consistent naming. Softw. Qual. J. 2006; 14(3): 261–282. Publisher Full Text
Ellenberg J, Swedlow JR, Barlow M, et al.: A call for public archives for biological image data. Nat. Methods. 2018; 15(11): 849–854. PubMed Abstract | Publisher Full Text | Free Full Text
Hartley M, Kleywegt GJ, Patwardhan A, et al.: The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. J. Mol. Biol. 2022; 434: 167505. PubMed Abstract | Publisher Full Text
Iudin A, Korir PK, Salavert-Torres J, et al.: EMPIAR: a public archive for raw electron microscopy image data. Nat. Methods. 2016; 13(5): 387–388. PubMed Abstract | Publisher Full Text
Iudin A, Korir PK, Somasundharam S, et al.: EMPIAR: the Electron Microscopy Public Image Archive. Nucleic Acids Res. 2023; 51: D1503–D1511. PubMed Abstract | Publisher Full Text | Free Full Text
Katz WT, Plaza SM: DVID: Distributed Versioned Image-Oriented Dataservice. Front. Neural Circuits. 2019; 13. PubMed Abstract | Publisher Full Text | Free Full Text
Korir PK, Iudin A, Somasundharam S, et al.: bandbox (v0.2.1). Zenodo. 2022. Publisher Full Text
Li X, Mooney P, Zheng S, et al.: Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat. Methods. 2013; 10(6): 584–590. PubMed Abstract | Publisher Full Text | Free Full Text
Mastronarde D: Tomographic Reconstruction with the IMOD Software Package. Microsc. Microanal. 2006; 12(S02): 178–179. Publisher Full Text
Moore J, Allan C, Besson S, et al.: OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. Nat. Methods. 2021; 18(12): 1496–1498. PubMed Abstract | Publisher Full Text | Free Full Text
Pietzsch T, Saalfeld S, Preibisch S, et al.: BigDataViewer: visualization and processing for large image data sets. Nat. Methods. 2015; 12(6): 481–483. PubMed Abstract | Publisher Full Text
Punjani A, Rubinstein JL, Fleet DJ, et al.: cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat. Methods. 2017; 14(3): 290–296. PubMed Abstract | Publisher Full Text
Sarkans U, Chiu W, Collinson L, et al.: REMBI: Recommended Metadata for Biological Images—enabling reuse of microscopy data in biology. Nat. Methods. 2021; 18(12): 1418–1422. PubMed Abstract | Publisher Full Text | Free Full Text
Sarkans U, Gostev M, Athar A, et al.: The BioStudies database-one stop shop for all data supporting a life sciences study. Nucleic Acids Res. 2018; 46(D1): D1266–D1270. PubMed Abstract | Publisher Full Text | Free Full Text
Scheres SHW: A Bayesian View on Cryo-EM Structure Determination. J. Mol. Biol. 2012; 415(2): 406–418. PubMed Abstract | Publisher Full Text | Free Full Text
Tang G, Peng L, Baldwin PR, et al.: EMAN2: An extensible image processing suite for electron microscopy. J. Struct. Biol. 2007; 157(1): 38–46. PubMed Abstract | Publisher Full Text
Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016; 3: 160018. PubMed Abstract | Publisher Full Text | Free Full Text
Zhang K: Gctf: Real-time CTF determination and correction. J. Struct. Biol. 2016; 193(1): 1–12. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 23 Oct 2023

Author details Author details

¹ EMBL-EBI, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK

Paul K. Korir
Roles: Conceptualization, Formal Analysis, Methodology, Software, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Andrii Iudin
Roles: Data Curation, Writing – Review & Editing

Sriram Somasundharam
Roles: Writing – Review & Editing

Simone Weyand
Roles: Data Curation, Writing – Review & Editing

Osman Salih
Roles: Data Curation, Writing – Review & Editing

Matthew Hartley
Roles: Writing – Review & Editing

Ugis Sarkans
Roles: Writing – Review & Editing

Ardan Patwardhan
Roles: Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing

Gerard J. Kleywegt
Roles: Funding Acquisition, Project Administration, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by UKRI-MRC and UKRI-BBSRC (grants MR/L007835/1 and MR/P019544/1), the Wellcome Trust (grant 221371/Z/20/Z), and EMBL through contributions from its member states.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 27 Feb 2024, 12:1391

https://doi.org/10.12688/f1000research.129720.2

version 1

Published: 23 Oct 2023, 12:1391

https://doi.org/10.12688/f1000research.129720.1

© 2023 Korir PK et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Korir PK, Iudin A, Somasundharam S et al. Ten recommendations for organising bioimaging data for archival [version 1; peer review: 4 approved with reservations]. F1000Research 2023, 12(ELIXIR):1391 (https://doi.org/10.12688/f1000research.129720.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 23 Oct 2023

Views

Reviewer Report 21 Nov 2023

William T. Katz, Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, USA

Virginia Scarlett, Howard Hughes Medical Institute's Janelia Research Campus, Ashburn, Virginia, USA; Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, Virginia, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.142422.r217552

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize data folders and files to maximize ease of use. Ten recommendations are provided as well as a lightweight command-line tool for inspecting datasets. Since there continues to be an acceleration in both the number and size of these datasets accessible through various repositories, both the recommendations and tool from experienced archivists are useful and should be published, though we feel some revision of the document is warranted.

The introduction describes the broader context of bioimaging data management before focusing on the contributions of the article. There could be clearer differentiation of efforts to standardize bioimaging metadata (REMBI, QUAREP-LiMi), file formats and associated libraries (OME-TIFF, OME-NGFF, Zarr, n5), and local or cloud-based services that provide Data APIs (DVID, BossDB) with some level of abstraction in how data is actually stored. The recommendations and tool mainly apply to file-based solutions though some of the recommendations, such as naming, would be applicable to other forms of big data repositories. We suggest that the authors clarify the scope of their contributions.

It should be noted that some of the efforts to standardize data and its distribution also have recommendations for organization of data. For example, OME-NGFF requires segmentation to be in a directory called “labels/”.

In the third paragraph of Motivation, the terms “ways and means” and “organisational resources” are unclear though some of your examples (folder hierarchy, file formats, identifiers) show how data can be organized. We suggest you start with some examples and then introduce “organizational resources” as a term.

If standardization is not an aim, can bandbox be configured to remove warnings not agreed upon by a user? In Figure 2, the printing of the word “warning” for datasets with no red flags seems odd. We would suggest using “check” as in “name check” or “structure check” if no warnings exist.

Given recommendation (8)b and the article’s bioimaging focus, the bandbox tool should work by default with well-known, large-scale formats like Zarr and N5. In testing, it appears that bandbox doesn’t recognized file extensions used by such formats like .json and .zarr. The configurability of bandbox is a nice feature and should be mentioned in the article. This would allow other tool builders to contribute configurations for validating common formats and it seems like the regex capability could allow folder hierarchy requirements.

The command-line bandbox tool should limit warning output to some maximum number of lines by default. This is particularly true for massive, chunked datasets consisting of many files and folders. We would suggest adding a “verbose” flag to allow full results to be output perhaps to a file.

Some minor points:

The description of the bandbox tool could be moved out of the Motivation section and after listing the recommendations.

Figure 1 has too small font sizes and would not be readable for printed copies as well as expending quite a bit of black ink.

The phenomenon described in the first sub-bullet under 'Verbosity' is an interesting point that seems to deserve its own name. Maybe something like, 'redundant nesting' or 'over-nesting'. A name would also make it easier to connect to solution 3A, which is conceptually related. Also, the second half of this bullet point is in monospace font, but it should be Times New Roman or whatever.

Some recommendations are more universally advisable than others. For those points, we’d recommend dropping “Consider” for stronger language.

In (4)b, “Most archives allow multiple separate entries to be linked or grouped,” it’s not clear what qualifies as an archive since data could be made available through cloud providers’ object stores and other facilities.

In (5)b, could you clarify in what ways dates and times are “ambiguous attributes”?

In the sentence, “bandbox examines the tree associate with the nested hierarchy…” the word bandbox should be in monospace font.

What is the rationale for limiting folder depth to 3 or 4 levels?

In (7)b, “Do not include personal identifiers in folder names.” Personal identifiers should be clarified.

For (7)e, zero-padding should be considered for any sequentially ordered set of files. A good case is 2D slices of a 3D volume as described.

For (8)b, consider citing OME-NGFF and OME-TIFF as recommended community formats.

For (9)a, the recommendation for an overview could explicitly suggest listing the facets used to organize the data.

In Figure 4, is the single “brief_description” folder at that level recommended instead of adding the descriptive information to a README file? Perhaps a real description should be used in the example to make it clear why recommendation (3)b doesn’t apply.

Is the topic of the opinion article discussed accurately in the context of the current literature?

Yes
Are all factual statements correct and adequately supported by citations?

Partly
Are arguments sufficiently supported by evidence from the published literature?

Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Data engineering; biomedical image processing and analysis.

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

CITE

Report a concern

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response

Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize ... Continue reading Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize data folders and files to maximize ease of use. Ten recommendations are provided as well as a lightweight command-line tool for inspecting datasets. Since there continues to be an acceleration in both the number and size of these datasets accessible through various repositories, both the recommendations and tool from experienced archivists are useful and should be published, though we feel some revision of the document is warranted.

The introduction describes the broader context of bioimaging data management before focusing on the contributions of the article. There could be clearer differentiation of efforts to standardize bioimaging metadata (REMBI, QUAREP-LiMi), file formats and associated libraries (OME-TIFF, OME-NGFF, Zarr, n5), and local or cloud-based services that provide Data APIs (DVID, BossDB) with some level of abstraction in how data is actually stored. The recommendations and tool mainly apply to file-based solutions though some of the recommendations, such as naming, would be applicable to other forms of big data repositories. We suggest that the authors clarify the scope of their contributions.

It should be noted that some of the efforts to standardize data and its distribution also have recommendations for organization of data. For example, OME-NGFF requires segmentation to be in a directory called “labels/”.

In the third paragraph of Motivation, the terms “ways and means” and “organisational resources” are unclear though some of your examples (folder hierarchy, file formats, identifiers) show how data can be organized. We suggest you start with some examples and then introduce “organizational resources” as a term.

We accept this correction and have updated the text to better reflect this point.

If standardization is not an aim, can bandbox be configured to remove warnings not agreed upon by a user? In Figure 2, the printing of the word “warning” for datasets with no red flags seems odd. We would suggest using “check” as in “name check” or “structure check” if no warnings exist.

We have released an updated version (bandbox v0.2.2) where these have been amended.

Given recommendation (8)b and the article’s bioimaging focus, the bandbox tool should work by default with well-known, large-scale formats like Zarr and N5. In testing, it appears that bandbox doesn’t recognized file extensions used by such formats like .json and .zarr. The configurability of bandbox is a nice feature and should be mentioned in the article. This would allow other tool builders to contribute configurations for validating common formats and it seems like the regex capability could allow folder hierarchy requirements.

We have clarified in the text that bandbox is configurable.

The command-line bandbox tool should limit warning output to some maximum number of lines by default. This is particularly true for massive, chunked datasets consisting of many files and folders. We would suggest adding a “verbose” flag to allow full results to be output perhaps to a file.

This has been updated in bandbox v0.2.2. Instead of printing all results by default, we have substituted the -S/--summarise flag with a -a/--all flag so that by default users don’t get overwhelmed. The instruction to use the new flag is now highlighted in yellow text beneath each section with more than a certain number of results.

Some minor points:

The description of the bandbox tool could be moved out of the Motivation section and after listing the recommendations.

We have now included a detailed description of bandbox in the Software Availability section.

Figure 1 has too small font sizes and would not be readable for printed copies as well as expending quite a bit of black ink.

We accept the suggestion and have changed all images to have a light background.

The phenomenon described in the first sub-bullet under 'Verbosity' is an interesting point that seems to deserve its own name. Maybe something like, 'redundant nesting' or 'over-nesting'. A name would also make it easier to connect to solution 3A, which is conceptually related. Also, the second half of this bullet point is in monospace font, but it should be Times New Roman or whatever.

We have given the section the name ‘Verbosity/Redundancy’.

The use of monospace font here is intentional to distinguish between literal text and computer text (file/folder names, commands, tools).

Some recommendations are more universally advisable than others. For those points, we’d recommend dropping “Consider” for stronger language.

In (4)b, “Most archives allow multiple separate entries to be linked or grouped,” it’s not clear what qualifies as an archive since data could be made available through cloud providers’ object stores and other facilities.

We have provided a definition of ‘archive’ in the opening paragraph of the article.

In (5)b, could you clarify in what ways dates and times are “ambiguous attributes”?

We have provided an explanation on this in the text.

Dates and times are ambiguous to the extent that they do not provide meaningful attributes associated with the experiment. While it can be assumed that dates on file names refer to the date of collection, this is not instrumental to the actual data i.e. knowing the date of collection adds no scientific value. Furthermore, having every single image file with the same date consumes precious ‘naming space’ of files, which can either be provided once in the name of the parent folder or as part of the metadata, where it would be expected to convey useful information to users.

In the sentence, “bandbox examines the tree associate with the nested hierarchy…” the word bandbox should be in monospace font.

This has been corrected in the text.

What is the rationale for limiting folder depth to 3 or 4 levels?

We have argued this point based on the ISA framework.

In (7)b, “Do not include personal identifiers in folder names.” Personal identifiers should be clarified.

We accept this point and have provided some examples of what is meant by ‘personal identifiers’.

For (7)e, zero-padding should be considered for any sequentially ordered set of files. A good case is 2D slices of a 3D volume as described.

We have included ‘sequential ordering’ as another example of this phenomenon.

For (8)b, consider citing OME-NGFF and OME-TIFF as recommended community formats.

We accept the suggestion and have amended the text as requested.

For (9)a, the recommendation for an overview could explicitly suggest listing the facets used to organize the data.

We have included a sentence outlining what may be included in the README file.

In Figure 4, is the single “brief_description” folder at that level recommended instead of adding the descriptive information to a README file? Perhaps a real description should be used in the example to make it clear why recommendation (3)b doesn’t apply.

This term is purely illustrative as are the names of the files and folders.
Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize data folders and files to maximize ease of use. Ten recommendations are provided as well as a lightweight command-line tool for inspecting datasets. Since there continues to be an acceleration in both the number and size of these datasets accessible through various repositories, both the recommendations and tool from experienced archivists are useful and should be published, though we feel some revision of the document is warranted.

The introduction describes the broader context of bioimaging data management before focusing on the contributions of the article. There could be clearer differentiation of efforts to standardize bioimaging metadata (REMBI, QUAREP-LiMi), file formats and associated libraries (OME-TIFF, OME-NGFF, Zarr, n5), and local or cloud-based services that provide Data APIs (DVID, BossDB) with some level of abstraction in how data is actually stored. The recommendations and tool mainly apply to file-based solutions though some of the recommendations, such as naming, would be applicable to other forms of big data repositories. We suggest that the authors clarify the scope of their contributions.

It should be noted that some of the efforts to standardize data and its distribution also have recommendations for organization of data. For example, OME-NGFF requires segmentation to be in a directory called “labels/”.

In the third paragraph of Motivation, the terms “ways and means” and “organisational resources” are unclear though some of your examples (folder hierarchy, file formats, identifiers) show how data can be organized. We suggest you start with some examples and then introduce “organizational resources” as a term.

We accept this correction and have updated the text to better reflect this point.

If standardization is not an aim, can bandbox be configured to remove warnings not agreed upon by a user? In Figure 2, the printing of the word “warning” for datasets with no red flags seems odd. We would suggest using “check” as in “name check” or “structure check” if no warnings exist.

We have released an updated version (bandbox v0.2.2) where these have been amended.

Given recommendation (8)b and the article’s bioimaging focus, the bandbox tool should work by default with well-known, large-scale formats like Zarr and N5. In testing, it appears that bandbox doesn’t recognized file extensions used by such formats like .json and .zarr. The configurability of bandbox is a nice feature and should be mentioned in the article. This would allow other tool builders to contribute configurations for validating common formats and it seems like the regex capability could allow folder hierarchy requirements.

We have clarified in the text that bandbox is configurable.

The command-line bandbox tool should limit warning output to some maximum number of lines by default. This is particularly true for massive, chunked datasets consisting of many files and folders. We would suggest adding a “verbose” flag to allow full results to be output perhaps to a file.

This has been updated in bandbox v0.2.2. Instead of printing all results by default, we have substituted the -S/--summarise flag with a -a/--all flag so that by default users don’t get overwhelmed. The instruction to use the new flag is now highlighted in yellow text beneath each section with more than a certain number of results.

Some minor points:

The description of the bandbox tool could be moved out of the Motivation section and after listing the recommendations.

We have now included a detailed description of bandbox in the Software Availability section.

Figure 1 has too small font sizes and would not be readable for printed copies as well as expending quite a bit of black ink.

We accept the suggestion and have changed all images to have a light background.

The phenomenon described in the first sub-bullet under 'Verbosity' is an interesting point that seems to deserve its own name. Maybe something like, 'redundant nesting' or 'over-nesting'. A name would also make it easier to connect to solution 3A, which is conceptually related. Also, the second half of this bullet point is in monospace font, but it should be Times New Roman or whatever.

We have given the section the name ‘Verbosity/Redundancy’.

The use of monospace font here is intentional to distinguish between literal text and computer text (file/folder names, commands, tools).

Some recommendations are more universally advisable than others. For those points, we’d recommend dropping “Consider” for stronger language.

In (4)b, “Most archives allow multiple separate entries to be linked or grouped,” it’s not clear what qualifies as an archive since data could be made available through cloud providers’ object stores and other facilities.

We have provided a definition of ‘archive’ in the opening paragraph of the article.

In (5)b, could you clarify in what ways dates and times are “ambiguous attributes”?

We have provided an explanation on this in the text.

Dates and times are ambiguous to the extent that they do not provide meaningful attributes associated with the experiment. While it can be assumed that dates on file names refer to the date of collection, this is not instrumental to the actual data i.e. knowing the date of collection adds no scientific value. Furthermore, having every single image file with the same date consumes precious ‘naming space’ of files, which can either be provided once in the name of the parent folder or as part of the metadata, where it would be expected to convey useful information to users.

In the sentence, “bandbox examines the tree associate with the nested hierarchy…” the word bandbox should be in monospace font.

This has been corrected in the text.

What is the rationale for limiting folder depth to 3 or 4 levels?

We have argued this point based on the ISA framework.

In (7)b, “Do not include personal identifiers in folder names.” Personal identifiers should be clarified.

We accept this point and have provided some examples of what is meant by ‘personal identifiers’.

For (7)e, zero-padding should be considered for any sequentially ordered set of files. A good case is 2D slices of a 3D volume as described.

We have included ‘sequential ordering’ as another example of this phenomenon.

For (8)b, consider citing OME-NGFF and OME-TIFF as recommended community formats.

We accept the suggestion and have amended the text as requested.

For (9)a, the recommendation for an overview could explicitly suggest listing the facets used to organize the data.

We have included a sentence outlining what may be included in the README file.

In Figure 4, is the single “brief_description” folder at that level recommended instead of adding the descriptive information to a README file? Perhaps a real description should be used in the example to make it clear why recommendation (3)b doesn’t apply.

This term is purely illustrative as are the names of the files and folders.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response

Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize ... Continue reading Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize data folders and files to maximize ease of use. Ten recommendations are provided as well as a lightweight command-line tool for inspecting datasets. Since there continues to be an acceleration in both the number and size of these datasets accessible through various repositories, both the recommendations and tool from experienced archivists are useful and should be published, though we feel some revision of the document is warranted.

The introduction describes the broader context of bioimaging data management before focusing on the contributions of the article. There could be clearer differentiation of efforts to standardize bioimaging metadata (REMBI, QUAREP-LiMi), file formats and associated libraries (OME-TIFF, OME-NGFF, Zarr, n5), and local or cloud-based services that provide Data APIs (DVID, BossDB) with some level of abstraction in how data is actually stored. The recommendations and tool mainly apply to file-based solutions though some of the recommendations, such as naming, would be applicable to other forms of big data repositories. We suggest that the authors clarify the scope of their contributions.

It should be noted that some of the efforts to standardize data and its distribution also have recommendations for organization of data. For example, OME-NGFF requires segmentation to be in a directory called “labels/”.

In the third paragraph of Motivation, the terms “ways and means” and “organisational resources” are unclear though some of your examples (folder hierarchy, file formats, identifiers) show how data can be organized. We suggest you start with some examples and then introduce “organizational resources” as a term.

We accept this correction and have updated the text to better reflect this point.

If standardization is not an aim, can bandbox be configured to remove warnings not agreed upon by a user? In Figure 2, the printing of the word “warning” for datasets with no red flags seems odd. We would suggest using “check” as in “name check” or “structure check” if no warnings exist.

We have released an updated version (bandbox v0.2.2) where these have been amended.

Given recommendation (8)b and the article’s bioimaging focus, the bandbox tool should work by default with well-known, large-scale formats like Zarr and N5. In testing, it appears that bandbox doesn’t recognized file extensions used by such formats like .json and .zarr. The configurability of bandbox is a nice feature and should be mentioned in the article. This would allow other tool builders to contribute configurations for validating common formats and it seems like the regex capability could allow folder hierarchy requirements.

We have clarified in the text that bandbox is configurable.

The command-line bandbox tool should limit warning output to some maximum number of lines by default. This is particularly true for massive, chunked datasets consisting of many files and folders. We would suggest adding a “verbose” flag to allow full results to be output perhaps to a file.

This has been updated in bandbox v0.2.2. Instead of printing all results by default, we have substituted the -S/--summarise flag with a -a/--all flag so that by default users don’t get overwhelmed. The instruction to use the new flag is now highlighted in yellow text beneath each section with more than a certain number of results.

Some minor points:

The description of the bandbox tool could be moved out of the Motivation section and after listing the recommendations.

We have now included a detailed description of bandbox in the Software Availability section.

Figure 1 has too small font sizes and would not be readable for printed copies as well as expending quite a bit of black ink.

We accept the suggestion and have changed all images to have a light background.

The phenomenon described in the first sub-bullet under 'Verbosity' is an interesting point that seems to deserve its own name. Maybe something like, 'redundant nesting' or 'over-nesting'. A name would also make it easier to connect to solution 3A, which is conceptually related. Also, the second half of this bullet point is in monospace font, but it should be Times New Roman or whatever.

We have given the section the name ‘Verbosity/Redundancy’.

The use of monospace font here is intentional to distinguish between literal text and computer text (file/folder names, commands, tools).

Some recommendations are more universally advisable than others. For those points, we’d recommend dropping “Consider” for stronger language.

In (4)b, “Most archives allow multiple separate entries to be linked or grouped,” it’s not clear what qualifies as an archive since data could be made available through cloud providers’ object stores and other facilities.

We have provided a definition of ‘archive’ in the opening paragraph of the article.

In (5)b, could you clarify in what ways dates and times are “ambiguous attributes”?

We have provided an explanation on this in the text.

Dates and times are ambiguous to the extent that they do not provide meaningful attributes associated with the experiment. While it can be assumed that dates on file names refer to the date of collection, this is not instrumental to the actual data i.e. knowing the date of collection adds no scientific value. Furthermore, having every single image file with the same date consumes precious ‘naming space’ of files, which can either be provided once in the name of the parent folder or as part of the metadata, where it would be expected to convey useful information to users.

In the sentence, “bandbox examines the tree associate with the nested hierarchy…” the word bandbox should be in monospace font.

This has been corrected in the text.

What is the rationale for limiting folder depth to 3 or 4 levels?

We have argued this point based on the ISA framework.

In (7)b, “Do not include personal identifiers in folder names.” Personal identifiers should be clarified.

We accept this point and have provided some examples of what is meant by ‘personal identifiers’.

For (7)e, zero-padding should be considered for any sequentially ordered set of files. A good case is 2D slices of a 3D volume as described.

We have included ‘sequential ordering’ as another example of this phenomenon.

For (8)b, consider citing OME-NGFF and OME-TIFF as recommended community formats.

We accept the suggestion and have amended the text as requested.

For (9)a, the recommendation for an overview could explicitly suggest listing the facets used to organize the data.

We have included a sentence outlining what may be included in the README file.

In Figure 4, is the single “brief_description” folder at that level recommended instead of adding the descriptive information to a README file? Perhaps a real description should be used in the example to make it clear why recommendation (3)b doesn’t apply.

This term is purely illustrative as are the names of the files and folders.
Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize data folders and files to maximize ease of use. Ten recommendations are provided as well as a lightweight command-line tool for inspecting datasets. Since there continues to be an acceleration in both the number and size of these datasets accessible through various repositories, both the recommendations and tool from experienced archivists are useful and should be published, though we feel some revision of the document is warranted.

The introduction describes the broader context of bioimaging data management before focusing on the contributions of the article. There could be clearer differentiation of efforts to standardize bioimaging metadata (REMBI, QUAREP-LiMi), file formats and associated libraries (OME-TIFF, OME-NGFF, Zarr, n5), and local or cloud-based services that provide Data APIs (DVID, BossDB) with some level of abstraction in how data is actually stored. The recommendations and tool mainly apply to file-based solutions though some of the recommendations, such as naming, would be applicable to other forms of big data repositories. We suggest that the authors clarify the scope of their contributions.

It should be noted that some of the efforts to standardize data and its distribution also have recommendations for organization of data. For example, OME-NGFF requires segmentation to be in a directory called “labels/”.

In the third paragraph of Motivation, the terms “ways and means” and “organisational resources” are unclear though some of your examples (folder hierarchy, file formats, identifiers) show how data can be organized. We suggest you start with some examples and then introduce “organizational resources” as a term.

We accept this correction and have updated the text to better reflect this point.

If standardization is not an aim, can bandbox be configured to remove warnings not agreed upon by a user? In Figure 2, the printing of the word “warning” for datasets with no red flags seems odd. We would suggest using “check” as in “name check” or “structure check” if no warnings exist.

We have released an updated version (bandbox v0.2.2) where these have been amended.

Given recommendation (8)b and the article’s bioimaging focus, the bandbox tool should work by default with well-known, large-scale formats like Zarr and N5. In testing, it appears that bandbox doesn’t recognized file extensions used by such formats like .json and .zarr. The configurability of bandbox is a nice feature and should be mentioned in the article. This would allow other tool builders to contribute configurations for validating common formats and it seems like the regex capability could allow folder hierarchy requirements.

We have clarified in the text that bandbox is configurable.

The command-line bandbox tool should limit warning output to some maximum number of lines by default. This is particularly true for massive, chunked datasets consisting of many files and folders. We would suggest adding a “verbose” flag to allow full results to be output perhaps to a file.

This has been updated in bandbox v0.2.2. Instead of printing all results by default, we have substituted the -S/--summarise flag with a -a/--all flag so that by default users don’t get overwhelmed. The instruction to use the new flag is now highlighted in yellow text beneath each section with more than a certain number of results.

Some minor points:

The description of the bandbox tool could be moved out of the Motivation section and after listing the recommendations.

We have now included a detailed description of bandbox in the Software Availability section.

Figure 1 has too small font sizes and would not be readable for printed copies as well as expending quite a bit of black ink.

We accept the suggestion and have changed all images to have a light background.

The phenomenon described in the first sub-bullet under 'Verbosity' is an interesting point that seems to deserve its own name. Maybe something like, 'redundant nesting' or 'over-nesting'. A name would also make it easier to connect to solution 3A, which is conceptually related. Also, the second half of this bullet point is in monospace font, but it should be Times New Roman or whatever.

We have given the section the name ‘Verbosity/Redundancy’.

The use of monospace font here is intentional to distinguish between literal text and computer text (file/folder names, commands, tools).

Some recommendations are more universally advisable than others. For those points, we’d recommend dropping “Consider” for stronger language.

In (4)b, “Most archives allow multiple separate entries to be linked or grouped,” it’s not clear what qualifies as an archive since data could be made available through cloud providers’ object stores and other facilities.

We have provided a definition of ‘archive’ in the opening paragraph of the article.

In (5)b, could you clarify in what ways dates and times are “ambiguous attributes”?

We have provided an explanation on this in the text.

Dates and times are ambiguous to the extent that they do not provide meaningful attributes associated with the experiment. While it can be assumed that dates on file names refer to the date of collection, this is not instrumental to the actual data i.e. knowing the date of collection adds no scientific value. Furthermore, having every single image file with the same date consumes precious ‘naming space’ of files, which can either be provided once in the name of the parent folder or as part of the metadata, where it would be expected to convey useful information to users.

In the sentence, “bandbox examines the tree associate with the nested hierarchy…” the word bandbox should be in monospace font.

This has been corrected in the text.

What is the rationale for limiting folder depth to 3 or 4 levels?

We have argued this point based on the ISA framework.

In (7)b, “Do not include personal identifiers in folder names.” Personal identifiers should be clarified.

We accept this point and have provided some examples of what is meant by ‘personal identifiers’.

For (7)e, zero-padding should be considered for any sequentially ordered set of files. A good case is 2D slices of a 3D volume as described.

We have included ‘sequential ordering’ as another example of this phenomenon.

For (8)b, consider citing OME-NGFF and OME-TIFF as recommended community formats.

We accept the suggestion and have amended the text as requested.

For (9)a, the recommendation for an overview could explicitly suggest listing the facets used to organize the data.

We have included a sentence outlining what may be included in the README file.

In Figure 4, is the single “brief_description” folder at that level recommended instead of adding the descriptive information to a README file? Perhaps a real description should be used in the example to make it clear why recommendation (3)b doesn’t apply.

This term is purely illustrative as are the names of the files and folders.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 21 Nov 2023

Sylvia Emmanuelle Le Dévédec, Division of Drug Discovery and Safety, Leiden Academic Centre of Drug Research, Universiteit Leiden, Leiden, South Holland, The Netherlands

Approved with Reservations

https://doi.org/10.5256/f1000research.142422.r217556

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an important step toward fostering a comprehensive discussion on the management of bioimaging data. The authors, primarily developers and bioinformaticians dealing with intricate datasets, have assembled a set of recommendations that, while valuable, may be perceived as overly abstract, potentially posing challenges for experimentalists who serve as the primary data producers.

One critical aspect that emerges is the need for greater clarity regarding the intended audience for this guideline. Currently, it appears somewhat ambiguous, leading to potential misalignment with the individuals it should be primarily targeting. It is recommended that the authors explicitly define the community they aim to address at the outset of the manuscript. If, indeed, the target audience is the data producers, particularly experimentalists, then a comprehensive revision of the recommendations may be necessary. Consideration should be given to conveying the guidelines in a more accessible language, ensuring that the practical implications for experimentalists are clearly delineated. Additionally, the authors might explore the possibility of tailoring specific sets of guidelines for distinct roles, such as data managers and data producers, to enhance relevance and utility.

Below are listed some specific points of attentions:

Abstract:

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.
If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.
The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.
The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.
The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?
Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?
Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?
Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.
‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.
‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?
‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.
Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.
Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.
The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.
The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

Is the topic of the opinion article discussed accurately in the context of the current literature?

Partly
Are all factual statements correct and adequately supported by citations?

Yes
Are arguments sufficiently supported by evidence from the published literature?

Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Yes

References

1. Petek M, Zagorščak M, Blejec A, Ramšak Ž, et al.: pISA-tree - a data management framework for life science research projects using a standardised directory tree. Scientific Data. 2022; 9 (1). Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Biology; image-based phenotypic profiling; microscopist; data generator; core facility management; FAIR metadata

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response
Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an ... Continue reading
Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an important step toward fostering a comprehensive discussion on the management of bioimaging data. The authors, primarily developers and bioinformaticians dealing with intricate datasets, have assembled a set of recommendations that, while valuable, may be perceived as overly abstract, potentially posing challenges for experimentalists who serve as the primary data producers.

One critical aspect that emerges is the need for greater clarity regarding the intended audience for this guideline. Currently, it appears somewhat ambiguous, leading to potential misalignment with the individuals it should be primarily targeting. It is recommended that the authors explicitly define the community they aim to address at the outset of the manuscript. If, indeed, the target audience is the data producers, particularly experimentalists, then a comprehensive revision of the recommendations may be necessary. Consideration should be given to conveying the guidelines in a more accessible language, ensuring that the practical implications for experimentalists are clearly delineated. Additionally, the authors might explore the possibility of tailoring specific sets of guidelines for distinct roles, such as data managers and data producers, to enhance relevance and utility.

Below are listed some specific points of attentions:

Abstract:

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.

We welcome the suggestion to clarify the intended audience and have updated the abstract to clarify this.

If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.

We aim to address a wide and varied audience, so the language and terminology needs to strike a balance for the content to be accessible and digestible by different groups. We hope we have managed a reasonable balance, especially following the many constructive suggestions of all the reviewers. If this reviewer has specific comments on sections in the revised manuscript that could be improved further in this respect we would be happy to attempt to do so.

The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.

We accept the correction and have spelled out in more precise terms what ‘more useful’ means and to whom this applies.

The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.

We accept the correction and have amended the text to reflect this.

The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

We accept the correction and have included, in the text, some examples of what bandbox is capable of doing.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?

We have included a definition of ‘archiving’ in the opening paragraph of the article.

Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?

Yes.

Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?

The earlier the better. Recommendation #1 (Design before data collection) highlights the impact of data planning before collection commences. The remaining recommendations outline various suggestions on how to improve the usability of the data. The organisation typically would happen on the storage device but can be done either through consoles or the appropriate graphical user interfaces.

Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.

We have edited the text for clarity.

‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.

Data depositors’ here refers to the individual(s) responsible for making the submission to the archive (previously defined) and this may or may not be the generator of the data. In many cases, the depositor is familiar with the data because they performed the analyses implying familiarity with handling the data.

‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?

We are grateful to the reviewer for pointing out this reference which is now referred to in the text.

‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

We used the term ‘facet’ in the same sense as in multifaceted, implying that a dataset may be viewed from various perspectives to discern distinct properties much in the same way as a gem. The reviewer’s proposal of ‘key’ does not fit this sense.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.

We have addressed the specificity of the audience in the amendments to the abstract (above).

Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.

We accept this comment and are only constrained by our capacity to extend the CLI tool to achieve the desired usability.

Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.

Recommendations 5, 6 and 7 go into considerable detail about what names to choose, which symbols to use in names and matters relating to identity. We are willing to revise any of the provided recommendations which remain unclear.

The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.

We appreciate that the authorship of this article does not represent the universe of experimental methods in imaging. We do point out various facets that may be relevant but leave it up to depositors (generators) who are in the best position to judge which to use when structuring/naming folders. We also point out in the abstract that we offer these recommendations to start discussions in various data-rich communities.

The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

As stated above, we do not think it necessary to specify exactly how data should be organised given the vast variety of experiments that can be carried out. We do state in the article (Motivation, paragraph 8) that “...our guide is intended to lead towards best practices rather than serve as a framework. …this guide does not aim to achieve standardisation. We believe it is more practical to have a set of best practices and leave it up to the data authors to decide how best to apply them.”
Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an important step toward fostering a comprehensive discussion on the management of bioimaging data. The authors, primarily developers and bioinformaticians dealing with intricate datasets, have assembled a set of recommendations that, while valuable, may be perceived as overly abstract, potentially posing challenges for experimentalists who serve as the primary data producers.

One critical aspect that emerges is the need for greater clarity regarding the intended audience for this guideline. Currently, it appears somewhat ambiguous, leading to potential misalignment with the individuals it should be primarily targeting. It is recommended that the authors explicitly define the community they aim to address at the outset of the manuscript. If, indeed, the target audience is the data producers, particularly experimentalists, then a comprehensive revision of the recommendations may be necessary. Consideration should be given to conveying the guidelines in a more accessible language, ensuring that the practical implications for experimentalists are clearly delineated. Additionally, the authors might explore the possibility of tailoring specific sets of guidelines for distinct roles, such as data managers and data producers, to enhance relevance and utility.

Below are listed some specific points of attentions:

Abstract:

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.

We welcome the suggestion to clarify the intended audience and have updated the abstract to clarify this.

If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.

We aim to address a wide and varied audience, so the language and terminology needs to strike a balance for the content to be accessible and digestible by different groups. We hope we have managed a reasonable balance, especially following the many constructive suggestions of all the reviewers. If this reviewer has specific comments on sections in the revised manuscript that could be improved further in this respect we would be happy to attempt to do so.

The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.

We accept the correction and have spelled out in more precise terms what ‘more useful’ means and to whom this applies.

The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.

We accept the correction and have amended the text to reflect this.

The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

We accept the correction and have included, in the text, some examples of what bandbox is capable of doing.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?

We have included a definition of ‘archiving’ in the opening paragraph of the article.

Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?

Yes.

Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?

The earlier the better. Recommendation #1 (Design before data collection) highlights the impact of data planning before collection commences. The remaining recommendations outline various suggestions on how to improve the usability of the data. The organisation typically would happen on the storage device but can be done either through consoles or the appropriate graphical user interfaces.

Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.

We have edited the text for clarity.

‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.

Data depositors’ here refers to the individual(s) responsible for making the submission to the archive (previously defined) and this may or may not be the generator of the data. In many cases, the depositor is familiar with the data because they performed the analyses implying familiarity with handling the data.

‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?

We are grateful to the reviewer for pointing out this reference which is now referred to in the text.

‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

We used the term ‘facet’ in the same sense as in multifaceted, implying that a dataset may be viewed from various perspectives to discern distinct properties much in the same way as a gem. The reviewer’s proposal of ‘key’ does not fit this sense.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.

We have addressed the specificity of the audience in the amendments to the abstract (above).

Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.

We accept this comment and are only constrained by our capacity to extend the CLI tool to achieve the desired usability.

Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.

Recommendations 5, 6 and 7 go into considerable detail about what names to choose, which symbols to use in names and matters relating to identity. We are willing to revise any of the provided recommendations which remain unclear.

The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.

We appreciate that the authorship of this article does not represent the universe of experimental methods in imaging. We do point out various facets that may be relevant but leave it up to depositors (generators) who are in the best position to judge which to use when structuring/naming folders. We also point out in the abstract that we offer these recommendations to start discussions in various data-rich communities.

The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

As stated above, we do not think it necessary to specify exactly how data should be organised given the vast variety of experiments that can be carried out. We do state in the article (Motivation, paragraph 8) that “...our guide is intended to lead towards best practices rather than serve as a framework. …this guide does not aim to achieve standardisation. We believe it is more practical to have a set of best practices and leave it up to the data authors to decide how best to apply them.”
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response
Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an ... Continue reading
Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an important step toward fostering a comprehensive discussion on the management of bioimaging data. The authors, primarily developers and bioinformaticians dealing with intricate datasets, have assembled a set of recommendations that, while valuable, may be perceived as overly abstract, potentially posing challenges for experimentalists who serve as the primary data producers.

One critical aspect that emerges is the need for greater clarity regarding the intended audience for this guideline. Currently, it appears somewhat ambiguous, leading to potential misalignment with the individuals it should be primarily targeting. It is recommended that the authors explicitly define the community they aim to address at the outset of the manuscript. If, indeed, the target audience is the data producers, particularly experimentalists, then a comprehensive revision of the recommendations may be necessary. Consideration should be given to conveying the guidelines in a more accessible language, ensuring that the practical implications for experimentalists are clearly delineated. Additionally, the authors might explore the possibility of tailoring specific sets of guidelines for distinct roles, such as data managers and data producers, to enhance relevance and utility.

Below are listed some specific points of attentions:

Abstract:

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.

We welcome the suggestion to clarify the intended audience and have updated the abstract to clarify this.

If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.

We aim to address a wide and varied audience, so the language and terminology needs to strike a balance for the content to be accessible and digestible by different groups. We hope we have managed a reasonable balance, especially following the many constructive suggestions of all the reviewers. If this reviewer has specific comments on sections in the revised manuscript that could be improved further in this respect we would be happy to attempt to do so.

The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.

We accept the correction and have spelled out in more precise terms what ‘more useful’ means and to whom this applies.

The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.

We accept the correction and have amended the text to reflect this.

The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

We accept the correction and have included, in the text, some examples of what bandbox is capable of doing.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?

We have included a definition of ‘archiving’ in the opening paragraph of the article.

Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?

Yes.

Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?

The earlier the better. Recommendation #1 (Design before data collection) highlights the impact of data planning before collection commences. The remaining recommendations outline various suggestions on how to improve the usability of the data. The organisation typically would happen on the storage device but can be done either through consoles or the appropriate graphical user interfaces.

Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.

We have edited the text for clarity.

‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.

Data depositors’ here refers to the individual(s) responsible for making the submission to the archive (previously defined) and this may or may not be the generator of the data. In many cases, the depositor is familiar with the data because they performed the analyses implying familiarity with handling the data.

‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?

We are grateful to the reviewer for pointing out this reference which is now referred to in the text.

‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

We used the term ‘facet’ in the same sense as in multifaceted, implying that a dataset may be viewed from various perspectives to discern distinct properties much in the same way as a gem. The reviewer’s proposal of ‘key’ does not fit this sense.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.

We have addressed the specificity of the audience in the amendments to the abstract (above).

Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.

We accept this comment and are only constrained by our capacity to extend the CLI tool to achieve the desired usability.

Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.

Recommendations 5, 6 and 7 go into considerable detail about what names to choose, which symbols to use in names and matters relating to identity. We are willing to revise any of the provided recommendations which remain unclear.

The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.

We appreciate that the authorship of this article does not represent the universe of experimental methods in imaging. We do point out various facets that may be relevant but leave it up to depositors (generators) who are in the best position to judge which to use when structuring/naming folders. We also point out in the abstract that we offer these recommendations to start discussions in various data-rich communities.

The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

As stated above, we do not think it necessary to specify exactly how data should be organised given the vast variety of experiments that can be carried out. We do state in the article (Motivation, paragraph 8) that “...our guide is intended to lead towards best practices rather than serve as a framework. …this guide does not aim to achieve standardisation. We believe it is more practical to have a set of best practices and leave it up to the data authors to decide how best to apply them.”
Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an important step toward fostering a comprehensive discussion on the management of bioimaging data. The authors, primarily developers and bioinformaticians dealing with intricate datasets, have assembled a set of recommendations that, while valuable, may be perceived as overly abstract, potentially posing challenges for experimentalists who serve as the primary data producers.

One critical aspect that emerges is the need for greater clarity regarding the intended audience for this guideline. Currently, it appears somewhat ambiguous, leading to potential misalignment with the individuals it should be primarily targeting. It is recommended that the authors explicitly define the community they aim to address at the outset of the manuscript. If, indeed, the target audience is the data producers, particularly experimentalists, then a comprehensive revision of the recommendations may be necessary. Consideration should be given to conveying the guidelines in a more accessible language, ensuring that the practical implications for experimentalists are clearly delineated. Additionally, the authors might explore the possibility of tailoring specific sets of guidelines for distinct roles, such as data managers and data producers, to enhance relevance and utility.

Below are listed some specific points of attentions:

Abstract:

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.

We welcome the suggestion to clarify the intended audience and have updated the abstract to clarify this.

If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.

We aim to address a wide and varied audience, so the language and terminology needs to strike a balance for the content to be accessible and digestible by different groups. We hope we have managed a reasonable balance, especially following the many constructive suggestions of all the reviewers. If this reviewer has specific comments on sections in the revised manuscript that could be improved further in this respect we would be happy to attempt to do so.

The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.

We accept the correction and have spelled out in more precise terms what ‘more useful’ means and to whom this applies.

The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.

We accept the correction and have amended the text to reflect this.

The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

We accept the correction and have included, in the text, some examples of what bandbox is capable of doing.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?

We have included a definition of ‘archiving’ in the opening paragraph of the article.

Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?

Yes.

Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?

The earlier the better. Recommendation #1 (Design before data collection) highlights the impact of data planning before collection commences. The remaining recommendations outline various suggestions on how to improve the usability of the data. The organisation typically would happen on the storage device but can be done either through consoles or the appropriate graphical user interfaces.

Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.

We have edited the text for clarity.

‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.

Data depositors’ here refers to the individual(s) responsible for making the submission to the archive (previously defined) and this may or may not be the generator of the data. In many cases, the depositor is familiar with the data because they performed the analyses implying familiarity with handling the data.

‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?

We are grateful to the reviewer for pointing out this reference which is now referred to in the text.

‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

We used the term ‘facet’ in the same sense as in multifaceted, implying that a dataset may be viewed from various perspectives to discern distinct properties much in the same way as a gem. The reviewer’s proposal of ‘key’ does not fit this sense.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.

We have addressed the specificity of the audience in the amendments to the abstract (above).

Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.

We accept this comment and are only constrained by our capacity to extend the CLI tool to achieve the desired usability.

Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.

Recommendations 5, 6 and 7 go into considerable detail about what names to choose, which symbols to use in names and matters relating to identity. We are willing to revise any of the provided recommendations which remain unclear.

The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.

We appreciate that the authorship of this article does not represent the universe of experimental methods in imaging. We do point out various facets that may be relevant but leave it up to depositors (generators) who are in the best position to judge which to use when structuring/naming folders. We also point out in the abstract that we offer these recommendations to start discussions in various data-rich communities.

The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

As stated above, we do not think it necessary to specify exactly how data should be organised given the vast variety of experiments that can be carried out. We do state in the article (Motivation, paragraph 8) that “...our guide is intended to lead towards best practices rather than serve as a framework. …this guide does not aim to achieve standardisation. We believe it is more practical to have a set of best practices and leave it up to the data authors to decide how best to apply them.”
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 08 Nov 2023

Kenneth H. L. Ho, Advanced Light Microscopy, The Francis Crick Institute, London, England, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.142422.r217555

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is therefore an area that needs more discussions, sharing of good practices.

I mostly agree with all the recommendations given, although I feel that the authors may need to make a good argument for some recommendations. Some choices seem arbitrary and I would like to see the rationale behind them.

After reading the paper, I am a bit confused by the article’s intended target audience.
Is the article recommendation aimed at most of the biologists who archive their bioimaging data mainly for the purpose of peer review and references? Or is the article and recommendation aiming for those database curators and producers of bioimaging databases, e.g. IDR (https://idr.openmicroscopy.org/) , SSBD (https://ssbd.riken.jp/database/), GDC (https://portal.gdc.cancer.gov/) , etc?

On page 4 under the heading ‘Motivation’, “We believe that recommendations outlined here maybe of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community”.
Does it include most biologists? I believe that most biologists archive their data to provide a record of their studies. Are the data depositors in the article and its intended audience refer to bioimaging database curators/producers instead of bench biologists?

I believe that the ten recommendations would be equally applied to most biologists even though their aim is to provide a record of their studies, the recommendations would help those database curators to organise their bioimage data in more meaningful ways.

I would like to see that part to be make clearer of its intended audience.

With regards to the recommendations, on page 8, ‘Naming’ (5) Meaningful names (b) “Consider avoiding ambiguous attributes such as dates and times.
The argument that they have “subtle variations” is not obvious to me. Is it because of variations of date formats used in different countries? Would it be solved if ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601) date format is used? Would that be a better recommendation? If not, would the authors care to expand their argument for that as dates are used frequently in filenames?

On page 9, (6) Naming symbols, (a) consider confining to lowercase letters.
It seems to be rather arbitrary to confine names to lowercase, why would it not work for all uppercase letters instead?

Similarly, in (b) avoid non-ASCII characters. Shouldn’t we be more inclusive of other languages that are non-ascii, e.g., European characters, or double byte Japanese, Korean and Chinese characters?

From a computer coding point of view, I intuitively understand the rationale for choosing ASCII but the article doesn’t seem to provide a valid argument for it. May I suggest the authors to use international standard for POSIX Portable Operating System Interface (IEEE 1003 ISO/IEC 9945) (ref: https://en.wikipedia.org/wiki/POSIX; https://www.ibm.com/docs/en/zos/2.2.0?topic=locales-posix-portable-file-name-character-set) instead. Choosing to use an international standard makes more sense instead of creating another separate standard specifically for bioimaging data. If the authors would like to keep their recommendations, I would like to see more justification for doing so.

On (d) upper limit on the length of file and folder names. The authors proposed a working upper limit of 50 characters. Again, it seems to be arbitrary, why not 80 characters, i.e. one line length on the old CRT terminal? The browser limit is a good reason, but I would like to see a more robust argument that 50 characters length is a good compromise.

The authors used an example of file path limit of 320 characters in the same paragraph, I believe it may cause confusion for the reader with filename length, which for most computer systems, is only 255 characters. (ref: https://en.wikipedia.org/wiki/Filename ). Since the authors also provide recommendation (3) on Folder depth and given example of path length problems on page 7 “Very long names of files/folders”, maybe the authors can discuss and recommend that together, under one section “filename length, path length and folder depth”. It may be easier for the reader to appreciate the choice that the authors make.

On recommendation (8) Friendly file formats. Maybe “Widely used file formats” is more applicable? I would prefer “Openly accessible file formats”, i.e., formats that there are readable by open-source tools. I guess widely used file formats would fit that description too and reflect more closely to what the authors want to convey. Proprietary software tools for accessing proprietary file formats may cause problems in the long run as companies often change hands, e.g., Olympus is now Evident, LaVision is now under Brunker. It is difficult to ensure that companies will keep supporting certain formats in their software tools in the future while funding bodies (in the UK) require archiving data for 10 to 20 years.

Is the topic of the opinion article discussed accurately in the context of the current literature?

Yes
Are all factual statements correct and adequately supported by citations?

Yes
Are arguments sufficiently supported by evidence from the published literature?

Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioimage informatics

CITE

Report a concern

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response

Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is ... Continue reading Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is therefore an area that needs more discussions, sharing of good practices.

I mostly agree with all the recommendations given, although I feel that the authors may need to make a good argument for some recommendations. Some choices seem arbitrary and I would like to see the rationale behind them.

After reading the paper, I am a bit confused by the article’s intended target audience.
Is the article recommendation aimed at most of the biologists who archive their bioimaging data mainly for the purpose of peer review and references? Or is the article and recommendation aiming for those database curators and producers of bioimaging databases, e.g. IDR (https://idr.openmicroscopy.org/) , SSBD (https://ssbd.riken.jp/database/), GDC (https://portal.gdc.cancer.gov/) , etc?

On page 4 under the heading ‘Motivation’, “We believe that recommendations outlined here maybe of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community”.
Does it include most biologists? I believe that most biologists archive their data to provide a record of their studies. Are the data depositors in the article and its intended audience refer to bioimaging database curators/producers instead of bench biologists?

I believe that the ten recommendations would be equally applied to most biologists even though their aim is to provide a record of their studies, the recommendations would help those database curators to organise their bioimage data in more meaningful ways.

I would like to see that part to be make clearer of its intended audience.

We accept the correction and have expanded the introductory paragraphs to outline specific audiences as well as clarified the type of user that ‘user’ refers to.

With regards to the recommendations, on page 8, ‘Naming’ (5) Meaningful names (b) “Consider avoiding ambiguous attributes such as dates and times.
The argument that they have “subtle variations” is not obvious to me. Is it because of variations of date formats used in different countries? Would it be solved if ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601) date format is used? Would that be a better recommendation? If not, would the authors care to expand their argument for that as dates are used frequently in filenames?

We accept the correction and have edited the text to better reflect the intended meaning.

On page 9, (6) Naming symbols, (a) consider confining to lowercase letters.
It seems to be rather arbitrary to confine names to lowercase, why would it not work for all uppercase letters instead?

We accept the correction and include arguments why we think it is preferable for file and folder names to be defined using lowercase letters.

Similarly, in (b) avoid non-ASCII characters. Shouldn’t we be more inclusive of other languages that are non-ascii, e.g., European characters, or double byte Japanese, Korean and Chinese characters?

From a computer coding point of view, I intuitively understand the rationale for choosing ASCII but the article doesn’t seem to provide a valid argument for it. May I suggest the authors to use international standard for POSIX Portable Operating System Interface (IEEE 1003 ISO/IEC 9945) (ref: https://en.wikipedia.org/wiki/POSIX;https://www.ibm.com/docs/en/zos/2.2.0?topic=locales-posix-portable-file-name-character-set) instead. Choosing to use an international standard makes more sense instead of creating another separate standard specifically for bioimaging data. If the authors would like to keep their recommendations, I would like to see more justification for doing so.

We accept the correction and now refer to POSIX as the standard to adhere to as well as provide reasons to do so.

On (d) upper limit on the length of file and folder names. The authors proposed a working upper limit of 50 characters. Again, it seems to be arbitrary, why not 80 characters, i.e. one line length on the old CRT terminal? The browser limit is a good reason, but I would like to see a more robust argument that 50 characters length is a good compromise.

The reviewer’s comment does raise a valid point. However, it is important to bear in mind that file and folder names add to one another and a length of 80 means that at a depth of three folders will admit paths of up to 240 characters. It is hard to precisely determine what would be reasonable: 20-30 characters may be too short for a lot of cases. One option would be to examine file lengths in current archives to determine the distribution of file name lengths but if the objective is to follow good rather than current practice this may not be sound.

The authors propose the above limits to start a conversation with the community on what would be a sensible value or range.

The authors used an example of file path limit of 320 characters in the same paragraph, I believe it may cause confusion for the reader with filename length, which for most computer systems, is only 255 characters. (ref: https://en.wikipedia.org/wiki/Filename ). Since the authors also provide recommendation (3) on Folder depth and given example of path length problems on page 7 “Very long names of files/folders”, maybe the authors can discuss and recommend that together, under one section “filename length, path length and folder depth”. It may be easier for the reader to appreciate the choice that the authors make.

We accept the correction and have restructured the article as suggested.

On recommendation (8) Friendly file formats. Maybe “Widely used file formats” is more applicable? I would prefer “Openly accessible file formats”, i.e., formats that there are readable by open-source tools. I guess widely used file formats would fit that description too and reflect more closely to what the authors want to convey. Proprietary software tools for accessing proprietary file formats may cause problems in the long run as companies often change hands, e.g., Olympus is now Evident, LaVision is now under Brunker. It is difficult to ensure that companies will keep supporting certain formats in their software tools in the future while funding bodies (in the UK) require archiving data for 10 to 20 years.

We have revised the section title to simply ‘File formats’. We appreciate that there are file formats that are unavoidable but proprietary (e.g., from microscopes) but our emphasis is on the openness of the formats because this enables the prevalence of tools which can reliably read the data. We have updated 8(b) to reflect this point.
Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is therefore an area that needs more discussions, sharing of good practices.

I mostly agree with all the recommendations given, although I feel that the authors may need to make a good argument for some recommendations. Some choices seem arbitrary and I would like to see the rationale behind them.

After reading the paper, I am a bit confused by the article’s intended target audience.
Is the article recommendation aimed at most of the biologists who archive their bioimaging data mainly for the purpose of peer review and references? Or is the article and recommendation aiming for those database curators and producers of bioimaging databases, e.g. IDR (https://idr.openmicroscopy.org/) , SSBD (https://ssbd.riken.jp/database/), GDC (https://portal.gdc.cancer.gov/) , etc?

On page 4 under the heading ‘Motivation’, “We believe that recommendations outlined here maybe of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community”.
Does it include most biologists? I believe that most biologists archive their data to provide a record of their studies. Are the data depositors in the article and its intended audience refer to bioimaging database curators/producers instead of bench biologists?

I believe that the ten recommendations would be equally applied to most biologists even though their aim is to provide a record of their studies, the recommendations would help those database curators to organise their bioimage data in more meaningful ways.

I would like to see that part to be make clearer of its intended audience.

We accept the correction and have expanded the introductory paragraphs to outline specific audiences as well as clarified the type of user that ‘user’ refers to.

With regards to the recommendations, on page 8, ‘Naming’ (5) Meaningful names (b) “Consider avoiding ambiguous attributes such as dates and times.
The argument that they have “subtle variations” is not obvious to me. Is it because of variations of date formats used in different countries? Would it be solved if ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601) date format is used? Would that be a better recommendation? If not, would the authors care to expand their argument for that as dates are used frequently in filenames?

We accept the correction and have edited the text to better reflect the intended meaning.

On page 9, (6) Naming symbols, (a) consider confining to lowercase letters.
It seems to be rather arbitrary to confine names to lowercase, why would it not work for all uppercase letters instead?

We accept the correction and include arguments why we think it is preferable for file and folder names to be defined using lowercase letters.

Similarly, in (b) avoid non-ASCII characters. Shouldn’t we be more inclusive of other languages that are non-ascii, e.g., European characters, or double byte Japanese, Korean and Chinese characters?

From a computer coding point of view, I intuitively understand the rationale for choosing ASCII but the article doesn’t seem to provide a valid argument for it. May I suggest the authors to use international standard for POSIX Portable Operating System Interface (IEEE 1003 ISO/IEC 9945) (ref: https://en.wikipedia.org/wiki/POSIX;https://www.ibm.com/docs/en/zos/2.2.0?topic=locales-posix-portable-file-name-character-set) instead. Choosing to use an international standard makes more sense instead of creating another separate standard specifically for bioimaging data. If the authors would like to keep their recommendations, I would like to see more justification for doing so.

We accept the correction and now refer to POSIX as the standard to adhere to as well as provide reasons to do so.

On (d) upper limit on the length of file and folder names. The authors proposed a working upper limit of 50 characters. Again, it seems to be arbitrary, why not 80 characters, i.e. one line length on the old CRT terminal? The browser limit is a good reason, but I would like to see a more robust argument that 50 characters length is a good compromise.

The reviewer’s comment does raise a valid point. However, it is important to bear in mind that file and folder names add to one another and a length of 80 means that at a depth of three folders will admit paths of up to 240 characters. It is hard to precisely determine what would be reasonable: 20-30 characters may be too short for a lot of cases. One option would be to examine file lengths in current archives to determine the distribution of file name lengths but if the objective is to follow good rather than current practice this may not be sound.

The authors propose the above limits to start a conversation with the community on what would be a sensible value or range.

The authors used an example of file path limit of 320 characters in the same paragraph, I believe it may cause confusion for the reader with filename length, which for most computer systems, is only 255 characters. (ref: https://en.wikipedia.org/wiki/Filename ). Since the authors also provide recommendation (3) on Folder depth and given example of path length problems on page 7 “Very long names of files/folders”, maybe the authors can discuss and recommend that together, under one section “filename length, path length and folder depth”. It may be easier for the reader to appreciate the choice that the authors make.

We accept the correction and have restructured the article as suggested.

On recommendation (8) Friendly file formats. Maybe “Widely used file formats” is more applicable? I would prefer “Openly accessible file formats”, i.e., formats that there are readable by open-source tools. I guess widely used file formats would fit that description too and reflect more closely to what the authors want to convey. Proprietary software tools for accessing proprietary file formats may cause problems in the long run as companies often change hands, e.g., Olympus is now Evident, LaVision is now under Brunker. It is difficult to ensure that companies will keep supporting certain formats in their software tools in the future while funding bodies (in the UK) require archiving data for 10 to 20 years.

We have revised the section title to simply ‘File formats’. We appreciate that there are file formats that are unavoidable but proprietary (e.g., from microscopes) but our emphasis is on the openness of the formats because this enables the prevalence of tools which can reliably read the data. We have updated 8(b) to reflect this point.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response

Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is ... Continue reading Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is therefore an area that needs more discussions, sharing of good practices.

I mostly agree with all the recommendations given, although I feel that the authors may need to make a good argument for some recommendations. Some choices seem arbitrary and I would like to see the rationale behind them.

After reading the paper, I am a bit confused by the article’s intended target audience.
Is the article recommendation aimed at most of the biologists who archive their bioimaging data mainly for the purpose of peer review and references? Or is the article and recommendation aiming for those database curators and producers of bioimaging databases, e.g. IDR (https://idr.openmicroscopy.org/) , SSBD (https://ssbd.riken.jp/database/), GDC (https://portal.gdc.cancer.gov/) , etc?

On page 4 under the heading ‘Motivation’, “We believe that recommendations outlined here maybe of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community”.
Does it include most biologists? I believe that most biologists archive their data to provide a record of their studies. Are the data depositors in the article and its intended audience refer to bioimaging database curators/producers instead of bench biologists?

I believe that the ten recommendations would be equally applied to most biologists even though their aim is to provide a record of their studies, the recommendations would help those database curators to organise their bioimage data in more meaningful ways.

I would like to see that part to be make clearer of its intended audience.

We accept the correction and have expanded the introductory paragraphs to outline specific audiences as well as clarified the type of user that ‘user’ refers to.

With regards to the recommendations, on page 8, ‘Naming’ (5) Meaningful names (b) “Consider avoiding ambiguous attributes such as dates and times.
The argument that they have “subtle variations” is not obvious to me. Is it because of variations of date formats used in different countries? Would it be solved if ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601) date format is used? Would that be a better recommendation? If not, would the authors care to expand their argument for that as dates are used frequently in filenames?

We accept the correction and have edited the text to better reflect the intended meaning.

On page 9, (6) Naming symbols, (a) consider confining to lowercase letters.
It seems to be rather arbitrary to confine names to lowercase, why would it not work for all uppercase letters instead?

We accept the correction and include arguments why we think it is preferable for file and folder names to be defined using lowercase letters.

Similarly, in (b) avoid non-ASCII characters. Shouldn’t we be more inclusive of other languages that are non-ascii, e.g., European characters, or double byte Japanese, Korean and Chinese characters?

From a computer coding point of view, I intuitively understand the rationale for choosing ASCII but the article doesn’t seem to provide a valid argument for it. May I suggest the authors to use international standard for POSIX Portable Operating System Interface (IEEE 1003 ISO/IEC 9945) (ref: https://en.wikipedia.org/wiki/POSIX;https://www.ibm.com/docs/en/zos/2.2.0?topic=locales-posix-portable-file-name-character-set) instead. Choosing to use an international standard makes more sense instead of creating another separate standard specifically for bioimaging data. If the authors would like to keep their recommendations, I would like to see more justification for doing so.

We accept the correction and now refer to POSIX as the standard to adhere to as well as provide reasons to do so.

On (d) upper limit on the length of file and folder names. The authors proposed a working upper limit of 50 characters. Again, it seems to be arbitrary, why not 80 characters, i.e. one line length on the old CRT terminal? The browser limit is a good reason, but I would like to see a more robust argument that 50 characters length is a good compromise.

The reviewer’s comment does raise a valid point. However, it is important to bear in mind that file and folder names add to one another and a length of 80 means that at a depth of three folders will admit paths of up to 240 characters. It is hard to precisely determine what would be reasonable: 20-30 characters may be too short for a lot of cases. One option would be to examine file lengths in current archives to determine the distribution of file name lengths but if the objective is to follow good rather than current practice this may not be sound.

The authors propose the above limits to start a conversation with the community on what would be a sensible value or range.

The authors used an example of file path limit of 320 characters in the same paragraph, I believe it may cause confusion for the reader with filename length, which for most computer systems, is only 255 characters. (ref: https://en.wikipedia.org/wiki/Filename ). Since the authors also provide recommendation (3) on Folder depth and given example of path length problems on page 7 “Very long names of files/folders”, maybe the authors can discuss and recommend that together, under one section “filename length, path length and folder depth”. It may be easier for the reader to appreciate the choice that the authors make.

We accept the correction and have restructured the article as suggested.

On recommendation (8) Friendly file formats. Maybe “Widely used file formats” is more applicable? I would prefer “Openly accessible file formats”, i.e., formats that there are readable by open-source tools. I guess widely used file formats would fit that description too and reflect more closely to what the authors want to convey. Proprietary software tools for accessing proprietary file formats may cause problems in the long run as companies often change hands, e.g., Olympus is now Evident, LaVision is now under Brunker. It is difficult to ensure that companies will keep supporting certain formats in their software tools in the future while funding bodies (in the UK) require archiving data for 10 to 20 years.

We have revised the section title to simply ‘File formats’. We appreciate that there are file formats that are unavoidable but proprietary (e.g., from microscopes) but our emphasis is on the openness of the formats because this enables the prevalence of tools which can reliably read the data. We have updated 8(b) to reflect this point.
Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is therefore an area that needs more discussions, sharing of good practices.

I mostly agree with all the recommendations given, although I feel that the authors may need to make a good argument for some recommendations. Some choices seem arbitrary and I would like to see the rationale behind them.

After reading the paper, I am a bit confused by the article’s intended target audience.
Is the article recommendation aimed at most of the biologists who archive their bioimaging data mainly for the purpose of peer review and references? Or is the article and recommendation aiming for those database curators and producers of bioimaging databases, e.g. IDR (https://idr.openmicroscopy.org/) , SSBD (https://ssbd.riken.jp/database/), GDC (https://portal.gdc.cancer.gov/) , etc?

On page 4 under the heading ‘Motivation’, “We believe that recommendations outlined here maybe of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community”.
Does it include most biologists? I believe that most biologists archive their data to provide a record of their studies. Are the data depositors in the article and its intended audience refer to bioimaging database curators/producers instead of bench biologists?

I believe that the ten recommendations would be equally applied to most biologists even though their aim is to provide a record of their studies, the recommendations would help those database curators to organise their bioimage data in more meaningful ways.

I would like to see that part to be make clearer of its intended audience.

We accept the correction and have expanded the introductory paragraphs to outline specific audiences as well as clarified the type of user that ‘user’ refers to.

With regards to the recommendations, on page 8, ‘Naming’ (5) Meaningful names (b) “Consider avoiding ambiguous attributes such as dates and times.
The argument that they have “subtle variations” is not obvious to me. Is it because of variations of date formats used in different countries? Would it be solved if ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601) date format is used? Would that be a better recommendation? If not, would the authors care to expand their argument for that as dates are used frequently in filenames?

We accept the correction and have edited the text to better reflect the intended meaning.

On page 9, (6) Naming symbols, (a) consider confining to lowercase letters.
It seems to be rather arbitrary to confine names to lowercase, why would it not work for all uppercase letters instead?

We accept the correction and include arguments why we think it is preferable for file and folder names to be defined using lowercase letters.

Similarly, in (b) avoid non-ASCII characters. Shouldn’t we be more inclusive of other languages that are non-ascii, e.g., European characters, or double byte Japanese, Korean and Chinese characters?

From a computer coding point of view, I intuitively understand the rationale for choosing ASCII but the article doesn’t seem to provide a valid argument for it. May I suggest the authors to use international standard for POSIX Portable Operating System Interface (IEEE 1003 ISO/IEC 9945) (ref: https://en.wikipedia.org/wiki/POSIX;https://www.ibm.com/docs/en/zos/2.2.0?topic=locales-posix-portable-file-name-character-set) instead. Choosing to use an international standard makes more sense instead of creating another separate standard specifically for bioimaging data. If the authors would like to keep their recommendations, I would like to see more justification for doing so.

We accept the correction and now refer to POSIX as the standard to adhere to as well as provide reasons to do so.

On (d) upper limit on the length of file and folder names. The authors proposed a working upper limit of 50 characters. Again, it seems to be arbitrary, why not 80 characters, i.e. one line length on the old CRT terminal? The browser limit is a good reason, but I would like to see a more robust argument that 50 characters length is a good compromise.

The reviewer’s comment does raise a valid point. However, it is important to bear in mind that file and folder names add to one another and a length of 80 means that at a depth of three folders will admit paths of up to 240 characters. It is hard to precisely determine what would be reasonable: 20-30 characters may be too short for a lot of cases. One option would be to examine file lengths in current archives to determine the distribution of file name lengths but if the objective is to follow good rather than current practice this may not be sound.

The authors propose the above limits to start a conversation with the community on what would be a sensible value or range.

The authors used an example of file path limit of 320 characters in the same paragraph, I believe it may cause confusion for the reader with filename length, which for most computer systems, is only 255 characters. (ref: https://en.wikipedia.org/wiki/Filename ). Since the authors also provide recommendation (3) on Folder depth and given example of path length problems on page 7 “Very long names of files/folders”, maybe the authors can discuss and recommend that together, under one section “filename length, path length and folder depth”. It may be easier for the reader to appreciate the choice that the authors make.

We accept the correction and have restructured the article as suggested.

On recommendation (8) Friendly file formats. Maybe “Widely used file formats” is more applicable? I would prefer “Openly accessible file formats”, i.e., formats that there are readable by open-source tools. I guess widely used file formats would fit that description too and reflect more closely to what the authors want to convey. Proprietary software tools for accessing proprietary file formats may cause problems in the long run as companies often change hands, e.g., Olympus is now Evident, LaVision is now under Brunker. It is difficult to ensure that companies will keep supporting certain formats in their software tools in the future while funding bodies (in the UK) require archiving data for 10 to 20 years.

We have revised the section title to simply ‘File formats’. We appreciate that there are file formats that are unavoidable but proprietary (e.g., from microscopes) but our emphasis is on the openness of the formats because this enables the prevalence of tools which can reliably read the data. We have updated 8(b) to reflect this point.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 01 Nov 2023

Sjors Scheres, Medical Research Council Laboratory of Molecular Biology, Cambridge, England, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.142422.r217553

This paper describes recommendations for organizing imaging data from the life sciences for archival purposes. Coming from the EBI, which is responsible for a large proportion of image archiving in the field, this advice is important and worth of dissemination to the wider scientific community. I am therefore, in principle, enthusiastic about its publication in F1000Research. However, I do think that the manuscript and the explicit recommendations can be improved, as the phrasing is often vague and some of the recommendations are ignored by the authors themselves. I would therefore recommend a careful re-think and re-write, especially of the 10 recommendations, for a revised version.

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Motivation:
p3: Would you not consider non-scientists looking at these images?

p3: "in the use of 'ways and means' of effecting the organisation"
-> I have no clue what this means.

p4: "To achieve this ... and so on"
-> These vague statements need rephrasing (e.g. 'we define [..] to the *various* attributes'). Also, what is 'generally available equipment'?

p5: it is not entirely clear to me from reading the paper what the bandbox program does. The paper states that it is based on the 10 recommendations that follow, but as explained below the re-organisation in Figure 4 still violates several recommendations... Perhaps some pseudo-code may be useful? Also, wouldn't it make more sense to first describe the recommendations and then introduce this program?

Recommendations:

Except for recommendation (7), all recommendations start with the word 'consider'. Given these are recommendations, that is superfluous. It may be clearer to use an imperative to directly state the recommendation (like done in 7).

p6: How is a "raw" TIFF file defined?

(3a) "the fewer the better" means a depth of 1 is best. This is probably not what the authors intended.

(3b) Having a subfolder called 'tiff' is often a good idea, e.g. when there is also a file with metadata describing those tiff images (which is typically the case). In fact, the recommended Figure 4 has a 'raw' folder, which has exactly the same meaning, thus contradicting this recommendation.

(5a) What are "any references that are tied to the instrument" and why should they be excluded? If these are references to the microscope used, they may be relevant to the user?

(5b) Why would dates in filenames be ambiguous and should they be avoided? Many data acquisition softwares write files with date and times in their names. Renaming these would, as the authors themselves point out, indeed be complicated and possibly lead to errors.

(7a) I have no clue what this means: "similar folders at different depths have the same names"

(7b) What are "personal identifiers"?

(7c) The name 'data' is actually used in the line below and in the recommended Figure 4. Also, I don't see why 'images' won't be an excellent name for a folder that contains images?

(7e) This recommendation may not be limited to slices of 3D data, which seems an arbitrarily narrow example for such broad recommendations. I personally thought of zer-padding images when I first read this (apparently not careful enough!). Perhaps using a term like "leading zeros" may be less ambiguous? Albeit perhaps useful to some of the readers, this is the only recommendation that has an explicit explanation of how to do this on two specific computer systems. Wouldn't this be something that could only be implemented in the bandbox program, so it could be used on any computer?

(8) What are "friendly file formats?". Also, the term "widely used file formats" is not well defined.

(9a) The example in Figure 4 does not have a README file...

(9b) "This can be achieved ... data presents" -> This sounds superfluous and condescending.

p11: The proposed path "data/brief_description/treatment3..." violates at least recommendations 3 (unused subfolder 'brief_description') and 7 (use of word 'data')

Is the topic of the opinion article discussed accurately in the context of the current literature?

Yes
Are all factual statements correct and adequately supported by citations?

Partly
Are arguments sufficiently supported by evidence from the published literature?

Yes
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Structural biologist; software developer

CITE

Report a concern

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response

Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not ... Continue reading Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not in itself an elusive goal. However, when the volume and variety of data increase by orders or magnitude then maintaining organisation and coherence in the data is difficult to achieve and by extension makes the data difficult to use. Therefore, organised data - in the context of large heterogeneous datasets - is an elusive goal.

Motivation:
p3: Would you not consider non-scientists looking at these images?

In the article we use the term ‘scientist’ for anyone who aims to use data for some end. The claim is not that only scientists look at data; rather, anyone (formal scientist or not) who uses the data is referred to as a scientist. The order of terminology is important.

p3: "in the use of 'ways and means' of effecting the organisation"
-> I have no clue what this means.

p4: "To achieve this ... and so on"
-> These vague statements need rephrasing (e.g. 'we define [..] to the *various* attributes'). Also, what is 'generally available equipment'?

We have rephrased vague statements in line with this remark.

p5: it is not entirely clear to me from reading the paper what the bandbox program does. The paper states that it is based on the 10 recommendations that follow, but as explained below the re-organisation in Figure 4 still violates several recommendations... Perhaps some pseudo-code may be useful? Also, wouldn't it make more sense to first describe the recommendations and then introduce this program?

We have moved the description of what bandbox does to the Software Availability section.

Recommendations:

Except for recommendation (7), all recommendations start with the word 'consider'. Given these are recommendations, that is superfluous. It may be clearer to use an imperative to directly state the recommendation (like done in 7).

p6: How is a "raw" TIFF file defined?

We have replaced this with the phrase ‘uncompressed TIFF files’.

(3a) "the fewer the better" means a depth of 1 is best. This is probably not what the authors intended.

We accept the correction and have clarified the argument based on the ISA (investigation, study, assay) framework.

(3b) Having a subfolder called 'tiff' is often a good idea, e.g. when there is also a file with metadata describing those tiff images (which is typically the case). In fact, the recommended Figure 4 has a 'raw' folder, which has exactly the same meaning, thus contradicting this recommendation.

We accept the correction and have revised the text for clarity that we are referring to intermediate folders where none are required.

(5a) What are "any references that are tied to the instrument" and why should they be excluded? If these are references to the microscope used, they may be relevant to the user?

We accept the correction and have revised the phrase.

(5b) Why would dates in filenames be ambiguous and should they be avoided? Many data acquisition softwares write files with date and times in their names. Renaming these would, as the authors themselves point out, indeed be complicated and possibly lead to errors.

The emphasis in the article is in having dates in folder names not file names. In (1b) we mention dates in file names as a possibility. Nevertheless, we do caution that date-time data on file names can also include subtle variations such as seconds so that numerous related files become non-trivial to work with due to these variations.

(7a) I have no clue what this means: "similar folders at different depths have the same names"

We have revised the recommendation and included an example with reference to Figure 3.

(7b) What are "personal identifiers"?

We have included a parenthetical remark with examples to illustrate what personal identifiers are.

(7c) The name 'data' is actually used in the line below and in the recommended Figure 4. Also, I don't see why 'images' won't be an excellent name for a folder that contains images?

These examples are purely for illustration purposes but are inspired by the actual structure used in EMPIAR in which the ‘data’ directory sits beside an XML file e.g. https://ftp.ebi.ac.uk/empiar/world_availability/10002/, which we have omitted here. They were generated from the examples provided in the git repository.

We believe it is better to have descriptive folder names as opposed to generic names, which provide no meaningful information. The name ‘images’ does not convey any meaningful information. Better would be something like ‘tomograms’ or ‘particles’. Nevertheless, this is configurable in bandbox using the bandbox/obvious_files option in the configuration file.

(7e) This recommendation may not be limited to slices of 3D data, which seems an arbitrarily narrow example for such broad recommendations. I personally thought of zer-padding images when I first read this (apparently not careful enough!). Perhaps using a term like "leading zeros'' may be less ambiguous? Albeit perhaps useful to some of the readers, this is the only recommendation that has an explicit explanation of how to do this on two specific computer systems. Wouldn't this be something that could only be implemented in the bandbox program, so it could be used on any computer?

We accept the correction and have rewritten the recommendation to clarify the context. We agree that implementing this in bandbox would enable a cross-platform solution and will plan this for a future release.

(8) What are "friendly file formats?". Also, the term "widely used file formats" is not well defined.

This has been revised to simply ‘File formats’.

(9a) The example in Figure 4 does not have a README file…

A README file has been added in the updated figure.

(9b) "This can be achieved ... data presents" -> This sounds superfluous and condescending.

This sentence has been deleted in the article.

p11: The proposed path "data/brief_description/treatment3..." violates at least recommendations 3 (unused subfolder 'brief_description') and 7 (use of word 'data')

As mentioned above, the example used here is purely illustrative and omits other content which would otherwise not violate this recommendation.
Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not in itself an elusive goal. However, when the volume and variety of data increase by orders or magnitude then maintaining organisation and coherence in the data is difficult to achieve and by extension makes the data difficult to use. Therefore, organised data - in the context of large heterogeneous datasets - is an elusive goal.

Motivation:
p3: Would you not consider non-scientists looking at these images?

In the article we use the term ‘scientist’ for anyone who aims to use data for some end. The claim is not that only scientists look at data; rather, anyone (formal scientist or not) who uses the data is referred to as a scientist. The order of terminology is important.

p3: "in the use of 'ways and means' of effecting the organisation"
-> I have no clue what this means.

p4: "To achieve this ... and so on"
-> These vague statements need rephrasing (e.g. 'we define [..] to the *various* attributes'). Also, what is 'generally available equipment'?

We have rephrased vague statements in line with this remark.

p5: it is not entirely clear to me from reading the paper what the bandbox program does. The paper states that it is based on the 10 recommendations that follow, but as explained below the re-organisation in Figure 4 still violates several recommendations... Perhaps some pseudo-code may be useful? Also, wouldn't it make more sense to first describe the recommendations and then introduce this program?

We have moved the description of what bandbox does to the Software Availability section.

Recommendations:

Except for recommendation (7), all recommendations start with the word 'consider'. Given these are recommendations, that is superfluous. It may be clearer to use an imperative to directly state the recommendation (like done in 7).

p6: How is a "raw" TIFF file defined?

We have replaced this with the phrase ‘uncompressed TIFF files’.

(3a) "the fewer the better" means a depth of 1 is best. This is probably not what the authors intended.

We accept the correction and have clarified the argument based on the ISA (investigation, study, assay) framework.

(3b) Having a subfolder called 'tiff' is often a good idea, e.g. when there is also a file with metadata describing those tiff images (which is typically the case). In fact, the recommended Figure 4 has a 'raw' folder, which has exactly the same meaning, thus contradicting this recommendation.

We accept the correction and have revised the text for clarity that we are referring to intermediate folders where none are required.

(5a) What are "any references that are tied to the instrument" and why should they be excluded? If these are references to the microscope used, they may be relevant to the user?

We accept the correction and have revised the phrase.

(5b) Why would dates in filenames be ambiguous and should they be avoided? Many data acquisition softwares write files with date and times in their names. Renaming these would, as the authors themselves point out, indeed be complicated and possibly lead to errors.

The emphasis in the article is in having dates in folder names not file names. In (1b) we mention dates in file names as a possibility. Nevertheless, we do caution that date-time data on file names can also include subtle variations such as seconds so that numerous related files become non-trivial to work with due to these variations.

(7a) I have no clue what this means: "similar folders at different depths have the same names"

We have revised the recommendation and included an example with reference to Figure 3.

(7b) What are "personal identifiers"?

We have included a parenthetical remark with examples to illustrate what personal identifiers are.

(7c) The name 'data' is actually used in the line below and in the recommended Figure 4. Also, I don't see why 'images' won't be an excellent name for a folder that contains images?

These examples are purely for illustration purposes but are inspired by the actual structure used in EMPIAR in which the ‘data’ directory sits beside an XML file e.g. https://ftp.ebi.ac.uk/empiar/world_availability/10002/, which we have omitted here. They were generated from the examples provided in the git repository.

We believe it is better to have descriptive folder names as opposed to generic names, which provide no meaningful information. The name ‘images’ does not convey any meaningful information. Better would be something like ‘tomograms’ or ‘particles’. Nevertheless, this is configurable in bandbox using the bandbox/obvious_files option in the configuration file.

(7e) This recommendation may not be limited to slices of 3D data, which seems an arbitrarily narrow example for such broad recommendations. I personally thought of zer-padding images when I first read this (apparently not careful enough!). Perhaps using a term like "leading zeros'' may be less ambiguous? Albeit perhaps useful to some of the readers, this is the only recommendation that has an explicit explanation of how to do this on two specific computer systems. Wouldn't this be something that could only be implemented in the bandbox program, so it could be used on any computer?

We accept the correction and have rewritten the recommendation to clarify the context. We agree that implementing this in bandbox would enable a cross-platform solution and will plan this for a future release.

(8) What are "friendly file formats?". Also, the term "widely used file formats" is not well defined.

This has been revised to simply ‘File formats’.

(9a) The example in Figure 4 does not have a README file…

A README file has been added in the updated figure.

(9b) "This can be achieved ... data presents" -> This sounds superfluous and condescending.

This sentence has been deleted in the article.

p11: The proposed path "data/brief_description/treatment3..." violates at least recommendations 3 (unused subfolder 'brief_description') and 7 (use of word 'data')

As mentioned above, the example used here is purely illustrative and omits other content which would otherwise not violate this recommendation.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

22 Mar 2024

Author Response

Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not ... Continue reading Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not in itself an elusive goal. However, when the volume and variety of data increase by orders or magnitude then maintaining organisation and coherence in the data is difficult to achieve and by extension makes the data difficult to use. Therefore, organised data - in the context of large heterogeneous datasets - is an elusive goal.

Motivation:
p3: Would you not consider non-scientists looking at these images?

In the article we use the term ‘scientist’ for anyone who aims to use data for some end. The claim is not that only scientists look at data; rather, anyone (formal scientist or not) who uses the data is referred to as a scientist. The order of terminology is important.

p3: "in the use of 'ways and means' of effecting the organisation"
-> I have no clue what this means.

p4: "To achieve this ... and so on"
-> These vague statements need rephrasing (e.g. 'we define [..] to the *various* attributes'). Also, what is 'generally available equipment'?

We have rephrased vague statements in line with this remark.

p5: it is not entirely clear to me from reading the paper what the bandbox program does. The paper states that it is based on the 10 recommendations that follow, but as explained below the re-organisation in Figure 4 still violates several recommendations... Perhaps some pseudo-code may be useful? Also, wouldn't it make more sense to first describe the recommendations and then introduce this program?

We have moved the description of what bandbox does to the Software Availability section.

Recommendations:

Except for recommendation (7), all recommendations start with the word 'consider'. Given these are recommendations, that is superfluous. It may be clearer to use an imperative to directly state the recommendation (like done in 7).

p6: How is a "raw" TIFF file defined?

We have replaced this with the phrase ‘uncompressed TIFF files’.

(3a) "the fewer the better" means a depth of 1 is best. This is probably not what the authors intended.

We accept the correction and have clarified the argument based on the ISA (investigation, study, assay) framework.

(3b) Having a subfolder called 'tiff' is often a good idea, e.g. when there is also a file with metadata describing those tiff images (which is typically the case). In fact, the recommended Figure 4 has a 'raw' folder, which has exactly the same meaning, thus contradicting this recommendation.

We accept the correction and have revised the text for clarity that we are referring to intermediate folders where none are required.

(5a) What are "any references that are tied to the instrument" and why should they be excluded? If these are references to the microscope used, they may be relevant to the user?

We accept the correction and have revised the phrase.

(5b) Why would dates in filenames be ambiguous and should they be avoided? Many data acquisition softwares write files with date and times in their names. Renaming these would, as the authors themselves point out, indeed be complicated and possibly lead to errors.

The emphasis in the article is in having dates in folder names not file names. In (1b) we mention dates in file names as a possibility. Nevertheless, we do caution that date-time data on file names can also include subtle variations such as seconds so that numerous related files become non-trivial to work with due to these variations.

(7a) I have no clue what this means: "similar folders at different depths have the same names"

We have revised the recommendation and included an example with reference to Figure 3.

(7b) What are "personal identifiers"?

We have included a parenthetical remark with examples to illustrate what personal identifiers are.

(7c) The name 'data' is actually used in the line below and in the recommended Figure 4. Also, I don't see why 'images' won't be an excellent name for a folder that contains images?

These examples are purely for illustration purposes but are inspired by the actual structure used in EMPIAR in which the ‘data’ directory sits beside an XML file e.g. https://ftp.ebi.ac.uk/empiar/world_availability/10002/, which we have omitted here. They were generated from the examples provided in the git repository.

We believe it is better to have descriptive folder names as opposed to generic names, which provide no meaningful information. The name ‘images’ does not convey any meaningful information. Better would be something like ‘tomograms’ or ‘particles’. Nevertheless, this is configurable in bandbox using the bandbox/obvious_files option in the configuration file.

(7e) This recommendation may not be limited to slices of 3D data, which seems an arbitrarily narrow example for such broad recommendations. I personally thought of zer-padding images when I first read this (apparently not careful enough!). Perhaps using a term like "leading zeros'' may be less ambiguous? Albeit perhaps useful to some of the readers, this is the only recommendation that has an explicit explanation of how to do this on two specific computer systems. Wouldn't this be something that could only be implemented in the bandbox program, so it could be used on any computer?

We accept the correction and have rewritten the recommendation to clarify the context. We agree that implementing this in bandbox would enable a cross-platform solution and will plan this for a future release.

(8) What are "friendly file formats?". Also, the term "widely used file formats" is not well defined.

This has been revised to simply ‘File formats’.

(9a) The example in Figure 4 does not have a README file…

A README file has been added in the updated figure.

(9b) "This can be achieved ... data presents" -> This sounds superfluous and condescending.

This sentence has been deleted in the article.

p11: The proposed path "data/brief_description/treatment3..." violates at least recommendations 3 (unused subfolder 'brief_description') and 7 (use of word 'data')

As mentioned above, the example used here is purely illustrative and omits other content which would otherwise not violate this recommendation.
Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not in itself an elusive goal. However, when the volume and variety of data increase by orders or magnitude then maintaining organisation and coherence in the data is difficult to achieve and by extension makes the data difficult to use. Therefore, organised data - in the context of large heterogeneous datasets - is an elusive goal.

Motivation:
p3: Would you not consider non-scientists looking at these images?

In the article we use the term ‘scientist’ for anyone who aims to use data for some end. The claim is not that only scientists look at data; rather, anyone (formal scientist or not) who uses the data is referred to as a scientist. The order of terminology is important.

p3: "in the use of 'ways and means' of effecting the organisation"
-> I have no clue what this means.

p4: "To achieve this ... and so on"
-> These vague statements need rephrasing (e.g. 'we define [..] to the *various* attributes'). Also, what is 'generally available equipment'?

We have rephrased vague statements in line with this remark.

p5: it is not entirely clear to me from reading the paper what the bandbox program does. The paper states that it is based on the 10 recommendations that follow, but as explained below the re-organisation in Figure 4 still violates several recommendations... Perhaps some pseudo-code may be useful? Also, wouldn't it make more sense to first describe the recommendations and then introduce this program?

We have moved the description of what bandbox does to the Software Availability section.

Recommendations:

Except for recommendation (7), all recommendations start with the word 'consider'. Given these are recommendations, that is superfluous. It may be clearer to use an imperative to directly state the recommendation (like done in 7).

p6: How is a "raw" TIFF file defined?

We have replaced this with the phrase ‘uncompressed TIFF files’.

(3a) "the fewer the better" means a depth of 1 is best. This is probably not what the authors intended.

We accept the correction and have clarified the argument based on the ISA (investigation, study, assay) framework.

(3b) Having a subfolder called 'tiff' is often a good idea, e.g. when there is also a file with metadata describing those tiff images (which is typically the case). In fact, the recommended Figure 4 has a 'raw' folder, which has exactly the same meaning, thus contradicting this recommendation.

We accept the correction and have revised the text for clarity that we are referring to intermediate folders where none are required.

(5a) What are "any references that are tied to the instrument" and why should they be excluded? If these are references to the microscope used, they may be relevant to the user?

We accept the correction and have revised the phrase.

(5b) Why would dates in filenames be ambiguous and should they be avoided? Many data acquisition softwares write files with date and times in their names. Renaming these would, as the authors themselves point out, indeed be complicated and possibly lead to errors.

The emphasis in the article is in having dates in folder names not file names. In (1b) we mention dates in file names as a possibility. Nevertheless, we do caution that date-time data on file names can also include subtle variations such as seconds so that numerous related files become non-trivial to work with due to these variations.

(7a) I have no clue what this means: "similar folders at different depths have the same names"

We have revised the recommendation and included an example with reference to Figure 3.

(7b) What are "personal identifiers"?

We have included a parenthetical remark with examples to illustrate what personal identifiers are.

(7c) The name 'data' is actually used in the line below and in the recommended Figure 4. Also, I don't see why 'images' won't be an excellent name for a folder that contains images?

These examples are purely for illustration purposes but are inspired by the actual structure used in EMPIAR in which the ‘data’ directory sits beside an XML file e.g. https://ftp.ebi.ac.uk/empiar/world_availability/10002/, which we have omitted here. They were generated from the examples provided in the git repository.

We believe it is better to have descriptive folder names as opposed to generic names, which provide no meaningful information. The name ‘images’ does not convey any meaningful information. Better would be something like ‘tomograms’ or ‘particles’. Nevertheless, this is configurable in bandbox using the bandbox/obvious_files option in the configuration file.

(7e) This recommendation may not be limited to slices of 3D data, which seems an arbitrarily narrow example for such broad recommendations. I personally thought of zer-padding images when I first read this (apparently not careful enough!). Perhaps using a term like "leading zeros'' may be less ambiguous? Albeit perhaps useful to some of the readers, this is the only recommendation that has an explicit explanation of how to do this on two specific computer systems. Wouldn't this be something that could only be implemented in the bandbox program, so it could be used on any computer?

We accept the correction and have rewritten the recommendation to clarify the context. We agree that implementing this in bandbox would enable a cross-platform solution and will plan this for a future release.

(8) What are "friendly file formats?". Also, the term "widely used file formats" is not well defined.

This has been revised to simply ‘File formats’.

(9a) The example in Figure 4 does not have a README file…

A README file has been added in the updated figure.

(9b) "This can be achieved ... data presents" -> This sounds superfluous and condescending.

This sentence has been deleted in the article.

p11: The proposed path "data/brief_description/treatment3..." violates at least recommendations 3 (unused subfolder 'brief_description') and 7 (use of word 'data')

As mentioned above, the example used here is purely illustrative and omits other content which would otherwise not violate this recommendation.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 23 Oct 2023

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 27 Feb 24	read	read	read	read
Version 1 23 Oct 23	read	read	read	read

Sjors Scheres, Medical Research Council Laboratory of Molecular Biology, Cambridge, UK
Kenneth H. L. Ho, The Francis Crick Institute, London, UK
Sylvia Emmanuelle Le Dévédec, Universiteit Leiden, Leiden, The Netherlands
William T. Katz, Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, USA

Virginia Scarlett, Howard Hughes Medical Institute's Janelia Research Campus, Ashburn, USA; Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

09 May 2024 | for Version 2

William T. Katz, Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, USA

Virginia Scarlett, Howard Hughes Medical Institute's Janelia Research Campus, Ashburn, Virginia, USA; Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, Virginia, USA

7 Views Cite this report Responses(0)

Approved With Reservations

We feel that the article remains at the status of 'Approved with Reservations'. We are grateful to the authors for the revisions that have been implemented, including a clearer explanation of 'organisational resources', improved readability of the figures, and a limit on the output of the bandbox program. However, the article references REMBI but not QUAREP-LiMi, and while it cites OME projects (OME-NGFF and OME-TIFF), it does not mention the impacts or limitations of those important efforts. Also, the article remains vague with respect to its audience. The authors should clarify to which biomedical imaging storage approaches (e.g., flat files, chunked formats, cloud-based APIs, etc.) their recommendations and tooling apply. If the article is intended for users of flat formats (such as TIFF) working on file systems, then the authors should clarify this because the title suggests a very broad applicability to archiving bioimaging data.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Data engineering; biomedical image processing and analysis.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

1 Views

04 Apr 2024 | for Version 2

Kenneth H. L. Ho, Advanced Light Microscopy, The Francis Crick Institute, London, England, UK

1 Views Cite this report Responses(0)

Approved

The authors have addressed all my previous comments, and I am satisfied with their revision. I have no further comments to make..

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioimage informatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

5 Views

18 Mar 2024 | for Version 2

Sylvia Emmanuelle Le Dévédec, Division of Drug Discovery and Safety, Leiden Academic Centre of Drug Research, Universiteit Leiden, Leiden, South Holland, The Netherlands

5 Views Cite this report Responses(0)

Approved

The revised version has effectively addressed the majority of my comments. Clarity regarding the intended audience and the importance of data archiving has been improved, and ambiguities have been identified and addressed.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biology; image-based phenotypic profiling; microscopist; data generator; core facility management; FAIR metadata

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

7 Views

13 Mar 2024 | for Version 2

Sjors Scheres, Medical Research Council Laboratory of Molecular Biology, Cambridge, England, UK

7 Views Cite this report Responses(0)

Approved

Most of my comments have been addressed in the revised version.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Structural biologist; software developer

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

33 Views

21 Nov 2023 | for Version 1

William T. Katz, Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, USA

Virginia Scarlett, Howard Hughes Medical Institute's Janelia Research Campus, Ashburn, Virginia, USA; Howard Hughes Medical Institute’s Janelia Research Campus, Ashburn, Virginia, USA

33 Views Cite this report Responses(1)

Approved With Reservations

Is the topic of the opinion article discussed accurately in the context of the current literature?

Yes
Are all factual statements correct and adequately supported by citations?

Partly
Are arguments sufficiently supported by evidence from the published literature?

Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Data engineering; biomedical image processing and analysis.

Respond to this report

Responses (1)

Author Response

22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

Response to Reviewer #4 (in italics)

Specific comments:

In this opinion article, the authors tackle an important but often-overlooked aspect of biomedical data archives: how best to organize data folders and files to maximize ease of use. Ten recommendations are provided as well as a lightweight command-line tool for inspecting datasets. Since there continues to be an acceleration in both the number and size of these datasets accessible through various repositories, both the recommendations and tool from experienced archivists are useful and should be published, though we feel some revision of the document is warranted.

The introduction describes the broader context of bioimaging data management before focusing on the contributions of the article. There could be clearer differentiation of efforts to standardize bioimaging metadata (REMBI, QUAREP-LiMi), file formats and associated libraries (OME-TIFF, OME-NGFF, Zarr, n5), and local or cloud-based services that provide Data APIs (DVID, BossDB) with some level of abstraction in how data is actually stored. The recommendations and tool mainly apply to file-based solutions though some of the recommendations, such as naming, would be applicable to other forms of big data repositories. We suggest that the authors clarify the scope of their contributions.

It should be noted that some of the efforts to standardize data and its distribution also have recommendations for organization of data. For example, OME-NGFF requires segmentation to be in a directory called “labels/”.

In the third paragraph of Motivation, the terms “ways and means” and “organisational resources” are unclear though some of your examples (folder hierarchy, file formats, identifiers) show how data can be organized. We suggest you start with some examples and then introduce “organizational resources” as a term.

We accept this correction and have updated the text to better reflect this point.

If standardization is not an aim, can bandbox be configured to remove warnings not agreed upon by a user? In Figure 2, the printing of the word “warning” for datasets with no red flags seems odd. We would suggest using “check” as in “name check” or “structure check” if no warnings exist.

We have released an updated version (bandbox v0.2.2) where these have been amended.

Given recommendation (8)b and the article’s bioimaging focus, the bandbox tool should work by default with well-known, large-scale formats like Zarr and N5. In testing, it appears that bandbox doesn’t recognized file extensions used by such formats like .json and .zarr. The configurability of bandbox is a nice feature and should be mentioned in the article. This would allow other tool builders to contribute configurations for validating common formats and it seems like the regex capability could allow folder hierarchy requirements.

We have clarified in the text that bandbox is configurable.

The command-line bandbox tool should limit warning output to some maximum number of lines by default. This is particularly true for massive, chunked datasets consisting of many files and folders. We would suggest adding a “verbose” flag to allow full results to be output perhaps to a file.

This has been updated in bandbox v0.2.2. Instead of printing all results by default, we have substituted the -S/--summarise flag with a -a/--all flag so that by default users don’t get overwhelmed. The instruction to use the new flag is now highlighted in yellow text beneath each section with more than a certain number of results.

Some minor points:

The description of the bandbox tool could be moved out of the Motivation section and after listing the recommendations.

We have now included a detailed description of bandbox in the Software Availability section.

Figure 1 has too small font sizes and would not be readable for printed copies as well as expending quite a bit of black ink.

We accept the suggestion and have changed all images to have a light background.

The phenomenon described in the first sub-bullet under 'Verbosity' is an interesting point that seems to deserve its own name. Maybe something like, 'redundant nesting' or 'over-nesting'. A name would also make it easier to connect to solution 3A, which is conceptually related. Also, the second half of this bullet point is in monospace font, but it should be Times New Roman or whatever.

We have given the section the name ‘Verbosity/Redundancy’.

The use of monospace font here is intentional to distinguish between literal text and computer text (file/folder names, commands, tools).

Some recommendations are more universally advisable than others. For those points, we’d recommend dropping “Consider” for stronger language.

In (4)b, “Most archives allow multiple separate entries to be linked or grouped,” it’s not clear what qualifies as an archive since data could be made available through cloud providers’ object stores and other facilities.

We have provided a definition of ‘archive’ in the opening paragraph of the article.

In (5)b, could you clarify in what ways dates and times are “ambiguous attributes”?

We have provided an explanation on this in the text.

Dates and times are ambiguous to the extent that they do not provide meaningful attributes associated with the experiment. While it can be assumed that dates on file names refer to the date of collection, this is not instrumental to the actual data i.e. knowing the date of collection adds no scientific value. Furthermore, having every single image file with the same date consumes precious ‘naming space’ of files, which can either be provided once in the name of the parent folder or as part of the metadata, where it would be expected to convey useful information to users.

In the sentence, “bandbox examines the tree associate with the nested hierarchy…” the word bandbox should be in monospace font.

This has been corrected in the text.

What is the rationale for limiting folder depth to 3 or 4 levels?

We have argued this point based on the ISA framework.

In (7)b, “Do not include personal identifiers in folder names.” Personal identifiers should be clarified.

We accept this point and have provided some examples of what is meant by ‘personal identifiers’.

For (7)e, zero-padding should be considered for any sequentially ordered set of files. A good case is 2D slices of a 3D volume as described.

We have included ‘sequential ordering’ as another example of this phenomenon.

For (8)b, consider citing OME-NGFF and OME-TIFF as recommended community formats.

We accept the suggestion and have amended the text as requested.

For (9)a, the recommendation for an overview could explicitly suggest listing the facets used to organize the data.

We have included a sentence outlining what may be included in the README file.

In Figure 4, is the single “brief_description” folder at that level recommended instead of adding the descriptive information to a README file? Perhaps a real description should be used in the example to make it clear why recommendation (3)b doesn’t apply.

This term is purely illustrative as are the names of the files and folders.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

10 Views

21 Nov 2023 | for Version 1

Sylvia Emmanuelle Le Dévédec, Division of Drug Discovery and Safety, Leiden Academic Centre of Drug Research, Universiteit Leiden, Leiden, South Holland, The Netherlands

10 Views Cite this report Responses(1)

Approved With Reservations

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.
If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.
The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.
The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.
The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?
Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?
Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?
Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.
‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.
‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?
‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.
Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.
Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.
The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.
The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

Is the topic of the opinion article discussed accurately in the context of the current literature?

Partly
Are all factual statements correct and adequately supported by citations?

Yes
Are arguments sufficiently supported by evidence from the published literature?

Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Yes

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biology; image-based phenotypic profiling; microscopist; data generator; core facility management; FAIR metadata

Respond to this report

Responses (1)

Author Response

22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

Response to Reviewer #3 (in italics)

Specific comments:

The guideline presented by Korir and colleagues, who are recognized experts in data structure, data organization, and FAIRification, marks an important step toward fostering a comprehensive discussion on the management of bioimaging data. The authors, primarily developers and bioinformaticians dealing with intricate datasets, have assembled a set of recommendations that, while valuable, may be perceived as overly abstract, potentially posing challenges for experimentalists who serve as the primary data producers.

One critical aspect that emerges is the need for greater clarity regarding the intended audience for this guideline. Currently, it appears somewhat ambiguous, leading to potential misalignment with the individuals it should be primarily targeting. It is recommended that the authors explicitly define the community they aim to address at the outset of the manuscript. If, indeed, the target audience is the data producers, particularly experimentalists, then a comprehensive revision of the recommendations may be necessary. Consideration should be given to conveying the guidelines in a more accessible language, ensuring that the practical implications for experimentalists are clearly delineated. Additionally, the authors might explore the possibility of tailoring specific sets of guidelines for distinct roles, such as data managers and data producers, to enhance relevance and utility.

Below are listed some specific points of attentions:

Abstract:

The abstract lacks clarity on the intended audience of these recommendations. It is essential to specify whether the guidelines primarily target core facility managers, data managers/stewards, bioinformaticians, or experimentalists.

We welcome the suggestion to clarify the intended audience and have updated the abstract to clarify this.
If the guidelines are intended for experimentalists, the current manuscript may not align with the needs of this non-expert audience. The language and content may need to be adapted to cater to individuals with limited knowledge in data management.

We aim to address a wide and varied audience, so the language and terminology needs to strike a balance for the content to be accessible and digestible by different groups. We hope we have managed a reasonable balance, especially following the many constructive suggestions of all the reviewers. If this reviewer has specific comments on sections in the revised manuscript that could be improved further in this respect we would be happy to attempt to do so.
The phrase "make future data depositions more useful" needs clarification. Who benefits from this increased usefulness, and in what way? Is the goal to enhance practicality, efficiency, or accessibility? A more specific explanation would enhance the abstract's clarity.

We accept the correction and have spelled out in more precise terms what ‘more useful’ means and to whom this applies.
The term "bioimaging community" is used in the abstract, but its specific meaning in the context of this manuscript is unclear. Defining this community will provide readers with a better understanding of the scope and relevance of the guidelines.

We accept the correction and have amended the text to reflect this.
The abstract mentions that Bandbox is designed "to facilitate the process of analyzing data organization." It would be beneficial to elaborate on how the analyzing functionality of Bandbox directly benefits the bioimaging community. Specific examples or scenarios demonstrating its advantages would enhance the abstract's informativeness.

We accept the correction and have included, in the text, some examples of what bandbox is capable of doing.

Introduction

What does data ‘archiving’ means exactly in this specific manuscript?

We have included a definition of ‘archiving’ in the opening paragraph of the article.
Objective of the guideline: Harmonising how to organise datasets for maximum usefulness with archival in mind?

Yes.
Where this organization should occur in the data life cycle: before/during or after generation? Where this organisation should occur? In which physical storage space?

The earlier the better. Recommendation #1 (Design before data collection) highlights the impact of data planning before collection commences. The remaining recommendations outline various suggestions on how to improve the usability of the data. The organisation typically would happen on the storage device but can be done either through consoles or the appropriate graphical user interfaces.
Organisation = order of the data. Organisation or order of the data should be implicitly connected to the related metadata and even contained somewhere in the metadata.

We have edited the text for clarity.
‘Good organisation (order) of data improves its usefulness and is the responsibility of the data depositors.’ Do you mean here the data generator or specifically the data depositor? Based on the description it seems like the data depositor is implicitly the data generator.

Data depositors’ here refers to the individual(s) responsible for making the submission to the archive (previously defined) and this may or may not be the generator of the data. In many cases, the depositor is familiar with the data because they performed the analyses implying familiarity with handling the data.
‘Users can immediately distinguish the various experimental categories’: should you not refer to (p)ISA to clarify what is meant by ‘experimental categories’ (https://doi.org/10.1038/s41597-022-01805-5)?

We are grateful to the reviewer for pointing out this reference which is now referred to in the text.
‘Facet refers to the various attributes germane to the experiment which may be included in the folder and file names’. Should ‘facet’ not be called ‘key’? If not then explain the differences between both terms.

We used the term ‘facet’ in the same sense as in multifaceted, implying that a dataset may be viewed from various perspectives to discern distinct properties much in the same way as a gem. The reviewer’s proposal of ‘key’ does not fit this sense.

Recommendations:

The potential users for these recommendations lack clear definition, and depending on the proposed users, the guide should be tailored for optimal understanding. Data depositors and generators often have different levels of familiarity compared to program developers or data stewards, employing distinct languages. Addressing these differences is crucial for ensuring accessibility and effectiveness.

We have addressed the specificity of the audience in the amendments to the abstract (above).
Open-source command-line interfaces can be intimidating, particularly for experimentalists who serve as the primary data generators and often act as data depositors. As a cell biologist and experimentalist, I find the proposed CLI tool, while impressive and useful, potentially challenging to navigate comfortably. Enhancements in user-friendliness or alternative interfaces might significantly benefit experimentalists who are integral to both data generation and deposition.

We accept this comment and are only constrained by our capacity to extend the CLI tool to achieve the desired usability.
Given the recommendation for data producers to pre-define structures before data collection, it becomes apparent that the target audience of this guide is experimentalists with limited knowledge of data management and programming. Including guidelines or tips on naming conventions would be particularly valuable for such users, enhancing the practicality and applicability of the recommendations.

Recommendations 5, 6 and 7 go into considerable detail about what names to choose, which symbols to use in names and matters relating to identity. We are willing to revise any of the provided recommendations which remain unclear.
The suggestion regarding folder contents description appears somewhat vague and may not be universally suitable for various experiment types. A more nuanced approach that considers the diversity of experiments would enhance the guide's usability.

We appreciate that the authorship of this article does not represent the universe of experimental methods in imaging. We do point out various facets that may be relevant but leave it up to depositors (generators) who are in the best position to judge which to use when structuring/naming folders. We also point out in the abstract that we offer these recommendations to start discussions in various data-rich communities.
The concept of "meaningful names" for folders raises questions about subjectivity and human sensitivity, which may not align with the precision required for effective data management structures. Establishing a clear naming convention, is objectively applicable across various contexts, would contribute to the robustness and reliability of the guide.

As stated above, we do not think it necessary to specify exactly how data should be organised given the vast variety of experiments that can be carried out. We do state in the article (Motivation, paragraph 8) that “...our guide is intended to lead towards best practices rather than serve as a framework. …this guide does not aim to achieve standardisation. We believe it is more practical to have a set of best practices and leave it up to the data authors to decide how best to apply them.”

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

22 Views

08 Nov 2023 | for Version 1

Kenneth H. L. Ho, Advanced Light Microscopy, The Francis Crick Institute, London, England, UK

22 Views Cite this report Responses(1)

Approved With Reservations

Is the topic of the opinion article discussed accurately in the context of the current literature?

Yes
Are all factual statements correct and adequately supported by citations?

Yes
Are arguments sufficiently supported by evidence from the published literature?

Partly
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioimage informatics

Respond to this report

Responses (1)

Author Response

22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

Response to Reviewer #2 (in italics)

Specific comments:

The article is timely as we are facing a deluge of bioimaging data with higher resolutions and automation. It is therefore an area that needs more discussions, sharing of good practices.

I mostly agree with all the recommendations given, although I feel that the authors may need to make a good argument for some recommendations. Some choices seem arbitrary and I would like to see the rationale behind them.

After reading the paper, I am a bit confused by the article’s intended target audience.
Is the article recommendation aimed at most of the biologists who archive their bioimaging data mainly for the purpose of peer review and references? Or is the article and recommendation aiming for those database curators and producers of bioimaging databases, e.g. IDR (https://idr.openmicroscopy.org/) , SSBD (https://ssbd.riken.jp/database/), GDC (https://portal.gdc.cancer.gov/) , etc?

On page 4 under the heading ‘Motivation’, “We believe that recommendations outlined here maybe of value to two principal groups of users: 1) data depositors, who need to design and prepare their data to improve its usability to the community”.
Does it include most biologists? I believe that most biologists archive their data to provide a record of their studies. Are the data depositors in the article and its intended audience refer to bioimaging database curators/producers instead of bench biologists?

I believe that the ten recommendations would be equally applied to most biologists even though their aim is to provide a record of their studies, the recommendations would help those database curators to organise their bioimage data in more meaningful ways.

I would like to see that part to be make clearer of its intended audience.

We accept the correction and have expanded the introductory paragraphs to outline specific audiences as well as clarified the type of user that ‘user’ refers to.

With regards to the recommendations, on page 8, ‘Naming’ (5) Meaningful names (b) “Consider avoiding ambiguous attributes such as dates and times.
The argument that they have “subtle variations” is not obvious to me. Is it because of variations of date formats used in different countries? Would it be solved if ISO 8601 (https://en.wikipedia.org/wiki/ISO_8601) date format is used? Would that be a better recommendation? If not, would the authors care to expand their argument for that as dates are used frequently in filenames?

We accept the correction and have edited the text to better reflect the intended meaning.

On page 9, (6) Naming symbols, (a) consider confining to lowercase letters.
It seems to be rather arbitrary to confine names to lowercase, why would it not work for all uppercase letters instead?

We accept the correction and include arguments why we think it is preferable for file and folder names to be defined using lowercase letters.

Similarly, in (b) avoid non-ASCII characters. Shouldn’t we be more inclusive of other languages that are non-ascii, e.g., European characters, or double byte Japanese, Korean and Chinese characters?

From a computer coding point of view, I intuitively understand the rationale for choosing ASCII but the article doesn’t seem to provide a valid argument for it. May I suggest the authors to use international standard for POSIX Portable Operating System Interface (IEEE 1003 ISO/IEC 9945) (ref: https://en.wikipedia.org/wiki/POSIX;https://www.ibm.com/docs/en/zos/2.2.0?topic=locales-posix-portable-file-name-character-set) instead. Choosing to use an international standard makes more sense instead of creating another separate standard specifically for bioimaging data. If the authors would like to keep their recommendations, I would like to see more justification for doing so.

We accept the correction and now refer to POSIX as the standard to adhere to as well as provide reasons to do so.

On (d) upper limit on the length of file and folder names. The authors proposed a working upper limit of 50 characters. Again, it seems to be arbitrary, why not 80 characters, i.e. one line length on the old CRT terminal? The browser limit is a good reason, but I would like to see a more robust argument that 50 characters length is a good compromise.

The reviewer’s comment does raise a valid point. However, it is important to bear in mind that file and folder names add to one another and a length of 80 means that at a depth of three folders will admit paths of up to 240 characters. It is hard to precisely determine what would be reasonable: 20-30 characters may be too short for a lot of cases. One option would be to examine file lengths in current archives to determine the distribution of file name lengths but if the objective is to follow good rather than current practice this may not be sound.

The authors propose the above limits to start a conversation with the community on what would be a sensible value or range.

The authors used an example of file path limit of 320 characters in the same paragraph, I believe it may cause confusion for the reader with filename length, which for most computer systems, is only 255 characters. (ref: https://en.wikipedia.org/wiki/Filename ). Since the authors also provide recommendation (3) on Folder depth and given example of path length problems on page 7 “Very long names of files/folders”, maybe the authors can discuss and recommend that together, under one section “filename length, path length and folder depth”. It may be easier for the reader to appreciate the choice that the authors make.

We accept the correction and have restructured the article as suggested.

On recommendation (8) Friendly file formats. Maybe “Widely used file formats” is more applicable? I would prefer “Openly accessible file formats”, i.e., formats that there are readable by open-source tools. I guess widely used file formats would fit that description too and reflect more closely to what the authors want to convey. Proprietary software tools for accessing proprietary file formats may cause problems in the long run as companies often change hands, e.g., Olympus is now Evident, LaVision is now under Brunker. It is difficult to ensure that companies will keep supporting certain formats in their software tools in the future while funding bodies (in the UK) require archiving data for 10 to 20 years.

We have revised the section title to simply ‘File formats’. We appreciate that there are file formats that are unavoidable but proprietary (e.g., from microscopes) but our emphasis is on the openness of the formats because this enables the prevalence of tools which can reliably read the data. We have updated 8(b) to reflect this point.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

30 Views

01 Nov 2023 | for Version 1

Sjors Scheres, Medical Research Council Laboratory of Molecular Biology, Cambridge, England, UK

30 Views Cite this report Responses(1)

Approved With Reservations

Is the topic of the opinion article discussed accurately in the context of the current literature?

Yes
Are all factual statements correct and adequately supported by citations?

Partly
Are arguments sufficiently supported by evidence from the published literature?

Yes
Are the conclusions drawn balanced and justified on the basis of the presented arguments?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Structural biologist; software developer

Respond to this report

Responses (1)

Author Response

22 Mar 2024

Gerard J Kleywegt, EMBL-EBI, Wellcome Genome Campus, Hinxton, CB10 1SD, UK

Response to Reviewer #1 (in italics)

Specific comments:

Abstract:
p1: The first sentence does not make sense: is 'organised data' an elusive goal?

Organised data is not in itself an elusive goal. However, when the volume and variety of data increase by orders or magnitude then maintaining organisation and coherence in the data is difficult to achieve and by extension makes the data difficult to use. Therefore, organised data - in the context of large heterogeneous datasets - is an elusive goal.

Motivation:
p3: Would you not consider non-scientists looking at these images?

In the article we use the term ‘scientist’ for anyone who aims to use data for some end. The claim is not that only scientists look at data; rather, anyone (formal scientist or not) who uses the data is referred to as a scientist. The order of terminology is important.

p3: "in the use of 'ways and means' of effecting the organisation"
-> I have no clue what this means.

p4: "To achieve this ... and so on"
-> These vague statements need rephrasing (e.g. 'we define [..] to the *various* attributes'). Also, what is 'generally available equipment'?

We have rephrased vague statements in line with this remark.

p5: it is not entirely clear to me from reading the paper what the bandbox program does. The paper states that it is based on the 10 recommendations that follow, but as explained below the re-organisation in Figure 4 still violates several recommendations... Perhaps some pseudo-code may be useful? Also, wouldn't it make more sense to first describe the recommendations and then introduce this program?

We have moved the description of what bandbox does to the Software Availability section.

Recommendations:

Except for recommendation (7), all recommendations start with the word 'consider'. Given these are recommendations, that is superfluous. It may be clearer to use an imperative to directly state the recommendation (like done in 7).

p6: How is a "raw" TIFF file defined?

We have replaced this with the phrase ‘uncompressed TIFF files’.

(3a) "the fewer the better" means a depth of 1 is best. This is probably not what the authors intended.

We accept the correction and have clarified the argument based on the ISA (investigation, study, assay) framework.

(3b) Having a subfolder called 'tiff' is often a good idea, e.g. when there is also a file with metadata describing those tiff images (which is typically the case). In fact, the recommended Figure 4 has a 'raw' folder, which has exactly the same meaning, thus contradicting this recommendation.

We accept the correction and have revised the text for clarity that we are referring to intermediate folders where none are required.

(5a) What are "any references that are tied to the instrument" and why should they be excluded? If these are references to the microscope used, they may be relevant to the user?

We accept the correction and have revised the phrase.

(5b) Why would dates in filenames be ambiguous and should they be avoided? Many data acquisition softwares write files with date and times in their names. Renaming these would, as the authors themselves point out, indeed be complicated and possibly lead to errors.

The emphasis in the article is in having dates in folder names not file names. In (1b) we mention dates in file names as a possibility. Nevertheless, we do caution that date-time data on file names can also include subtle variations such as seconds so that numerous related files become non-trivial to work with due to these variations.

(7a) I have no clue what this means: "similar folders at different depths have the same names"

We have revised the recommendation and included an example with reference to Figure 3.

(7b) What are "personal identifiers"?

We have included a parenthetical remark with examples to illustrate what personal identifiers are.

(7c) The name 'data' is actually used in the line below and in the recommended Figure 4. Also, I don't see why 'images' won't be an excellent name for a folder that contains images?

These examples are purely for illustration purposes but are inspired by the actual structure used in EMPIAR in which the ‘data’ directory sits beside an XML file e.g. https://ftp.ebi.ac.uk/empiar/world_availability/10002/, which we have omitted here. They were generated from the examples provided in the git repository.

We believe it is better to have descriptive folder names as opposed to generic names, which provide no meaningful information. The name ‘images’ does not convey any meaningful information. Better would be something like ‘tomograms’ or ‘particles’. Nevertheless, this is configurable in bandbox using the bandbox/obvious_files option in the configuration file.

(7e) This recommendation may not be limited to slices of 3D data, which seems an arbitrarily narrow example for such broad recommendations. I personally thought of zer-padding images when I first read this (apparently not careful enough!). Perhaps using a term like "leading zeros'' may be less ambiguous? Albeit perhaps useful to some of the readers, this is the only recommendation that has an explicit explanation of how to do this on two specific computer systems. Wouldn't this be something that could only be implemented in the bandbox program, so it could be used on any computer?

We accept the correction and have rewritten the recommendation to clarify the context. We agree that implementing this in bandbox would enable a cross-platform solution and will plan this for a future release.

(8) What are "friendly file formats?". Also, the term "widely used file formats" is not well defined.

This has been revised to simply ‘File formats’.

(9a) The example in Figure 4 does not have a README file…

A README file has been added in the updated figure.

(9b) "This can be achieved ... data presents" -> This sounds superfluous and condescending.

This sentence has been deleted in the article.

p11: The proposed path "data/brief_description/treatment3..." violates at least recommendations 3 (unused subfolder 'brief_description') and 7 (use of word 'data')

As mentioned above, the example used here is purely illustrative and omits other content which would otherwise not violate this recommendation.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] Brazma A, Hingamp P, Quackenbush J, et al.: Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat. Genet. 2001; 29(4): 365–371. PubMed Abstract | Publisher Full Text

[2] Lianhua C, Xingquan Z: Hashing Techniques. ACM Computing Surveys (CSUR). 2017. Publisher Full Text

[3] Datta S, Lakdawala R, Sarkar S: Understanding the Inter-Domain Presence of Research Topics in the Computing Discipline. IEEE Trans. Emerg. Top. Comput. 2021; 9(1): 366–378. Publisher Full Text

[4] Deissenboeck F, Pizka M: Concise and consistent naming. Softw. Qual. J. 2006; 14(3): 261–282. Publisher Full Text

[5] Ellenberg J, Swedlow JR, Barlow M, et al.: A call for public archives for biological image data. Nat. Methods. 2018; 15(11): 849–854. PubMed Abstract | Publisher Full Text | Free Full Text

[6] Hartley M, Kleywegt GJ, Patwardhan A, et al.: The BioImage Archive - Building a Home for Life-Sciences Microscopy Data. J. Mol. Biol. 2022; 434: 167505. PubMed Abstract | Publisher Full Text

[7] Iudin A, Korir PK, Salavert-Torres J, et al.: EMPIAR: a public archive for raw electron microscopy image data. Nat. Methods. 2016; 13(5): 387–388. PubMed Abstract | Publisher Full Text

[8] Iudin A, Korir PK, Somasundharam S, et al.: EMPIAR: the Electron Microscopy Public Image Archive. Nucleic Acids Res. 2023; 51: D1503–D1511. PubMed Abstract | Publisher Full Text | Free Full Text

[9] Katz WT, Plaza SM: DVID: Distributed Versioned Image-Oriented Dataservice. Front. Neural Circuits. 2019; 13. PubMed Abstract | Publisher Full Text | Free Full Text

[10] Korir PK, Iudin A, Somasundharam S, et al.: bandbox (v0.2.1). Zenodo. 2022. Publisher Full Text

[11] Li X, Mooney P, Zheng S, et al.: Electron counting and beam-induced motion correction enable near-atomic-resolution single-particle cryo-EM. Nat. Methods. 2013; 10(6): 584–590. PubMed Abstract | Publisher Full Text | Free Full Text

[12] Mastronarde D: Tomographic Reconstruction with the IMOD Software Package. Microsc. Microanal. 2006; 12(S02): 178–179. Publisher Full Text

[13] Moore J, Allan C, Besson S, et al.: OME-NGFF: a next-generation file format for expanding bioimaging data-access strategies. Nat. Methods. 2021; 18(12): 1496–1498. PubMed Abstract | Publisher Full Text | Free Full Text

[14] Pietzsch T, Saalfeld S, Preibisch S, et al.: BigDataViewer: visualization and processing for large image data sets. Nat. Methods. 2015; 12(6): 481–483. PubMed Abstract | Publisher Full Text

[15] Punjani A, Rubinstein JL, Fleet DJ, et al.: cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nat. Methods. 2017; 14(3): 290–296. PubMed Abstract | Publisher Full Text

[16] Sarkans U, Chiu W, Collinson L, et al.: REMBI: Recommended Metadata for Biological Images—enabling reuse of microscopy data in biology. Nat. Methods. 2021; 18(12): 1418–1422. PubMed Abstract | Publisher Full Text | Free Full Text

[17] Sarkans U, Gostev M, Athar A, et al.: The BioStudies database-one stop shop for all data supporting a life sciences study. Nucleic Acids Res. 2018; 46(D1): D1266–D1270. PubMed Abstract | Publisher Full Text | Free Full Text

[18] Scheres SHW: A Bayesian View on Cryo-EM Structure Determination. J. Mol. Biol. 2012; 415(2): 406–418. PubMed Abstract | Publisher Full Text | Free Full Text

[19] Tang G, Peng L, Baldwin PR, et al.: EMAN2: An extensible image processing suite for electron microscopy. J. Struct. Biol. 2007; 157(1): 38–46. PubMed Abstract | Publisher Full Text

[20] Wilkinson MD, Dumontier M, Aalbersberg IJJ, et al.: The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data. 2016; 3: 160018. PubMed Abstract | Publisher Full Text | Free Full Text

[21] Zhang K: Gctf: Real-time CTF determination and correction. J. Struct. Biol. 2016; 193(1): 1–12. PubMed Abstract | Publisher Full Text | Free Full Text

Ten recommendations for organising bioimaging data for archival

Abstract

Keywords

Introduction

Motivation

Figure 1. Example of a dataset with organisational red flags identified by bandbox such as use of spaces or non-ASCII characters, redundant directories and so on with an indication of the number of such entities found.

Figure 2. Example of a dataset with no red flags as inferred by bandbox.

Recommendations

Figure 3. Illustration of some of the ways in which subtle features of data organisation impact its usability.

Planning

Structure

Naming

Miscellaneous

Figure 4. Tree representation of the data from Figure 3 reorganised by applying some of the 10 recommendations proposed.

Conclusion

Data availability

Software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

Figure 1. Example of a dataset with organisational red flags identified by `bandbox` such as use of spaces or non-ASCII characters, redundant directories and so on with an indication of the number of such entities found.

Figure 2. Example of a dataset with no red flags as inferred by `bandbox`.