GeNePy3D: a quantitative geometry python toolbox for bioimaging

The advent of large-scale fluorescence and electronic microscopy techniques along with maturing image analysis is giving life sciences a deluge of geometrical objects in 2D/3D(+t) to deal with. These objects take the form of large scale, localised, precise, single cell, quantitative data such as cells’ positions, shapes, trajectories or lineages, axon traces in whole brains atlases or varied intracellular protein localisations, often in multiple experimental conditions. The data mining of those geometrical objects requires a variety of mathematical and computational tools of diverse accessibility and complexity. Here we present a new Python library for quantitative 3D geometry called GeNePy3D which helps handle and mine information and knowledge from geometric data, providing a unified application programming interface (API) to methods from several domains including computational geometry, scale space methods or spatial statistics. By framing this library as generically as possible, and by linking it to as many state-of-the-art reference algorithms and projects as needed, we help render those often specialist methods accessible to a larger community. We exemplify the usefulness of the GeNePy3D toolbox by re-analysing a recently published whole-brain zebrafish neuronal atlas, with other applications and examples available online. Along with an open source, documented and exemplified code, we release reusable containers to allow for convenient and wide usability and increased reproducibility.


Introduction
Bioimage informatics aims at bringing microscopy into quantitative biology, associating higher level information to pixels to answer complex biological questions. In particular machine learning based techniques 1 are easing the image analysis step, extracting geometrical objects from multidimensional images. But the next step, transforming that geometrical information into biological knowledge, involves a very diverse set of algorithmic tools in distinct communities, from spatial statistics 2,3 to computational geometry 4,5 or neuroinformatics 6 . Similarly, the software ecosystem around geometrical data analysis is very diverse and heterogeneous, with reference algorithm implementation spread across languages (Spatstat 7 for spatial statistics in R, CGAL 8 for computational geometry in C++) or across module in python (scipy 9 for generic algorithms, anytree 10 for trees, trimesh 11 for meshes etc), a lack of generic geometric data exchange format and standard graphical tools like Fiji 12 and Icy 13 being limited in the flexibility of the analysis easily available. To address this problem, we propose GeNePy3D 14,15 , a python library meant as a 'middleware' library to facilitate building data analysis workflows for geometrical objects by providing one convenient API for geometrical data I/O, conversion and interaction between geometrical objects and access to many common and less common algorithm. We will introduce below the architecture of the library and show one example workflow, re-analysing a published dataset of zebrafish brain neuronal traces by combining traces and brain region to extract quantitative metrics per region. GeNePy3D 14,15 was designed with any computational-minded life scientist as target user, to provide a simple and homogeneous API. GeNePy3D consists of four main objects ( Figure 1) corresponding to four basic geometrical objects of interest: Points (cells or intracellular object positions...), Curve (particles tracks, neurite branches, microtubules...), Tree (neuronal traces, dividing cell tracks) and Surface (cell surface or other tissue level structure...). Each of them has its own attributes, functions and I/O. We provide ways to transform between them, (decomposing a Tree into sequences of Curve, or converting Points into the Surface that enclose them). Interaction between objects of the same/different classes are also available (optimal transport-based distance between two Points, intersection between Curve and Surface, etc.) Altogether, GeNePy3D offers a unified and seamless way to analyse complex geometrical biological data.

Implementation
GeNePy3D is implemented in Python, taking advantage of a high-level programming language with simple syntax and many open-source packages. We reused algorithms and functions available from various recognised packages when possible, and developed our own implementation when needed, within a unique interface. Most of the packages we link to are available from the Python package Index (PyPi) and can be easily installed via Python package manager (pip). Figure 1 lists out some functions with colors denoting the package used. Beyond standard ones, more specific ones includes AnyTree for tree manipulation, TriMesh for surface manipulation or ScikitLearn for machine learning tasks. Other feature are listed as optional, as they come from harder to install or less recognized Figure 1. GeNePy3D architecture. The library is structured around four main classes for four principal geometrical objects, and propose various functions acting on them or converting between them, either implemented anew or linking to recognized library.

Amendments from Version 1
Small updates to answer reviewer comments include removal of reference to 'large scale', a clarification of reasoning behind licensing choices and removal of mention of alignment algorithms. We also updated the title, to remove mention of large scale, and updated affiliation of one of the authors.
Any further responses from the reviewers can be found at the end of the article REVISED sources, including the C++ library CGAL, only partially available in Python, for generic object interaction in 3D, or the optimal transport method implemented in PyEMD. Some original development available in GeNePy3d include an algorithm to compute local 3D scale we recently published 16 . Many common input/output formats are supported including SWC for Tree, CSV, XYZ for Points/Curve and STL and OFF for Surface. We release the library in two packages for licensing issues (see licenses below).

Operation
GeNePy3D works with Python 3.6. Details of the specific software requirements, documentation including the installation instruction and Python notebooks examples can be accessed via https://genepy3d.gitlab.io. Example pipelines using GeNePy3D are run using Jupyter notebooks. To ease the use and deployment of GeNePy3D we provide ready to use docker containers at https://gitlab.com/genepy3d/genepy3d_dockers.

Use case
To exemplify the use of GeNePy3D 14,15 , we reanalyzed a recently published dataset containing up to 2000 traced neurons across the whole brain of larval zebrafish 17 . The authors annotated 36 symmetric regions and established a connectivity atlas for the neurons within these regions. Figure 2A illustrates a possible workflow using GeNePy3D for reanalyzing the dataset. The inputs consist of neuronal traces in SWC formats and a 3D volume in NRRD format containing different annotated labels for the 36 brain regions. The traces are imported into GeNePy3D under Tree objects, while the regions are reconstructed into Surface objects using marching cube algorithm. Figure 2B top illustrates the outline of the Tectum along with all neuronal traces arriving to this brain region. We then extracted branching point positions from the neuronal traces (Tree→Points), decomposed them into sections (Tree→Curves) and checked whether the branching points or curve sections lies within or outside each region (interaction with Surface). Examples of decomposing the traces, computing sections inside and outside the Tectum region are shown in Figure 2B bottom. Finally, we measured within the brain regions neuronal lengths, number of branching points, tortuosities (proportion of length over distance between two end points of the curve), and local 3D scales 16 (scale at which the curve transforms to 3D).
Part of the resulting quantification obtained are shown Figure 2C. The top graph shows a longer neuronal length on averaged for groups of neurons arriving to and originating from the regions compared to ones passing through. Figure 2C bottom shows a map of the averaged neuronal length for each brain regions for arriving neurons showing that neurons coming from fore-and midbrain are much longer than those from hindbrain. Detail of all processing steps and additional quantified results can be found at https://gitlab.com/genepy3d/genepy3d_examples/-/tree/master/ zebrafish_atlas.

Conclusions
The advent of machine learning and developments in biological imaging is leading to numerous geometrical datasets, and GeNePy3d 14,15 aims at enabling complex analysis workflows based on those objects. But as in other aspects of bioimage informatics, the key will be for the community to work together and define common formats and structures for region of interests and geometric objects to ease the interactions between the various visualisation, data management or analysis tools, and convert raw images to biological knowledge. GeNePy3d is ready to become a component of that ecosystem.

Archived source code at time of publication:
• GeNePy3D: https://doi.org/10.5281/zenodo.4269466 14 . We wanted to release GeNePy3D under a BSD license but could not avoid the use of some GPL license software, forcing us to such a solution. Practical consequences should be minimal in most circumstances thanks to modern python package management.

edition, 2020.
Open Peer Review spatial statistic, and other fields into a unified API. The usefulness of the package is demonstrated by an example of re-analyzing the zebrafish brain region and neuron traces.
The tool should be very useful in gaining insights from the microscopy imaging data. The package is adequately documented, and the Docker file makes it easier to use for many users.

Is the rationale for developing the new software tool clearly explained? Yes
Is the description of the software tool technically sound? Yes

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Yes dataset of Kunst et al. is well-chosen and convincing. Overall, GeNePy3D is an absolutely relevant software that is likely to be impactful in the bioimage analysis community.

Major comments:
The authors mention Fiji and Icy, which are indeed widely used to quantify geometry from bioimages. They however do not spell out clearly how they envision GeNePy3D to interact with these GUI-based (and Java-based!) alternatives. More details on this aspect should be provided.

○
The GitHub repo should be better documented, in particular when it comes to describing the methods used in functions such as a curve or surface alignment (hence the "sufficient details of the code" point above flagged as "partly").

○
The article's title emphasizes the "large scale" aspect of GeNePy3D. From the implementation's description, I am under the impression that the scaling ability of this package comes "for free" from the fact that numpy, scipy, pandas, etc all scale extremely well. If there is more and specific efforts have been put into developing methods in a specific manner so as to allow processing of large datasets, this aspect should be discussed in more detail in the implementation section. If not and if this really is simply a consequence of using well-developed Python libraries, I would suggest downplaying a bit the "large scale" aspect of the toolbox.

Minor comments:
The package name, GeNePy3D, is poorly chosen for two reasons: 1) the meaning of the acronym is unclear, 2) there exists already at least 3 different Python packages called genepy, all containing entirely unrelated algorithms. I would strongly suggest coming up with a more self-descriptive and less overused name.
○ I am no expert in software licensing, but I worry that the two-licenses solution adopted here may be unnecessarily confusing for the end-user. Wouldn't there be a way to package the entirety of the library under a single BSD3/GPL license, or license it all under the most restrictive of the two if applicable? If having two repos under two licenses really is the only possible solution, some clear explanation of why this is so should be provided in the article.

Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others? Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Yes packages: genepy3d.readthedocs.io and genepy3d_gpl.readthedocs.io. Specifically for the alignment functions: that aspect is still essentially lacking in the library as is it, mainly because we had no occasion to implement it. There is one 'align' function for curves, which is quite specific and is now better documented. There is a huge bibliography on those topics and many available implementations; it is something that would be very useful to have in the library and that we do plan to tackle in the future. We removed explicit mention of alignment algorithms in the text, to avoid misleading the reader.
The article's title emphasizes the "large scale" aspect of GeNePy3D. From the implementation's description, I am under the impression that the scaling ability of this package comes "for free" from the fact that numpy, scipy, pandas, etc all scale extremely well. If there is more and specific efforts have been put into developing methods in a specific manner so as to allow processing of large datasets, this aspect should be discussed in more detail in the implementation section. If not and if this really is simply a consequence of using well-developed Python libraries, I would suggest downplaying a bit the "large scale" aspect of the toolbox.

○
The original applications were on large scale images, hence the 'large scale' in the title, but it is true that the genepy3d library itself does not provide any specific development for large scale processing. We propose to drop 'large scale' from the title to reflect that point, if that is possible at this stage of the publication process.

Minor comments:
The package name, GeNePy3D, is poorly chosen for two reasons: 1) the meaning of the acronym is unclear, 2) there exists already at least 3 different Python packages called genepy, all containing entirely unrelated algorithms. I would strongly suggest coming up with a more self-descriptive and less overused name.

○
Finding a name for a project is complicated and we agree that, in retrospect, better choices most likely exist. It originally stood for Geometry of Neuron in Python in 3D. It might be clearer when heard (something like 'geeneepaï'). We may indeed change it in the future, if it gains traction and the user and developer base expands; when going out of alpha for example. I am no expert in software licensing, but I worry that the two-licenses solution adopted here may be unnecessarily confusing for the end-user. Wouldn't there be a way to package the entirety of the library under a single BSD3/GPL license, or license it all under the most restrictive of the two if applicable? If having two repos under two licenses really is the only possible solution, some clear explanation of why this is so should be provided in the article.

○
We would have welcomed another solution but we do not know of a clean, legal way to mix, in a project under BSD, both GPL and BSD code. Since we want the bulk of the library to be under a BSD license for compatibility with the rest of the python ecosystem (see also this argument for BSD in scientific code: https://www.astrobetter.com/blog/2014/03/10/the-whys-and-hows-oflicensing-scientific-code/), this means corralling out GPL bits. We will try to make them as small as possible, and the added complexity is mitigated by modern python package management, which makes it trivial to install additional packages. We added a sentence to explain this reasoning in the text.
Competing Interests: No competing interests were disclosed.