SCUBA implements a storage format-agnostic API for single-cell data access in R

William M. Showers; Jairav Desai; Krysta L. Engel; Clayton Smith; Craig T. Jordan; Austin E. Gillen

doi:10.12688/f1000research.154675.1

Home Browse SCUBA implements a storage format-agnostic API for single-cell data...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

SCUBA implements a storage format-agnostic API for single-cell data access in R

[version 1; peer review: 1 approved, 1 approved with reservations, 1 not approved]

William M. Showers^1,2, Jairav Desai², Krysta L. Engel^1,2, Clayton Smith^1,2, Craig T. Jordan¹, Austin E. Gillen^1,3

William M. Showers^1,2, Jairav Desai², [...] Krysta L. Engel^1,2, Clayton Smith^1,2, Craig T. Jordan¹, Austin E. Gillen^1,3

PUBLISHED 21 Oct 2024

Author details Author details

¹ Division of Hematology, University of Colorado Anschutz Medical Campus School of Medicine, Aurora, Colorado, USA
² RefinedScience, Aurora, Colorado, USA
³ Rocky Mountain Regional VA Medical Center, Aurora, Colorado, USA

William M. Showers
Roles: Conceptualization, Data Curation, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jairav Desai
Roles: Software, Writing – Review & Editing

Krysta L. Engel
Roles: Writing – Review & Editing

Clayton Smith
Roles: Funding Acquisition

Craig T. Jordan
Roles: Funding Acquisition

Austin E. Gillen
Roles: Conceptualization, Funding Acquisition, Software, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the RPackage gateway.

Abstract

While robust tools exist for the analysis of single-cell datasets in both Python and R, interoperability is limited, and analysis tools generally only accept one object class. Considerable programming expertise is required to integrate tools across package ecosystems into a comprehensive analysis, due to their differing languages and internal data structures. This complicates validation of results and leads to inconsistent visualizations between analysis suites. Conversion between object formats is the most common solution, but this is difficult and error-prone due to the rapid pace of development of the analysis suites and their underlying data structures. To address this, we created SCUBA (Single-Cell Unified Backend API), an R package that implements a unified data access API for all common R and Python single-cell object formats. SCUBA extends the data access approach from the widely used Seurat package to SingleCellExperiment and anndata objects. SCUBA also implements new data-specific access functions for all supported object types. Performance scales well across all SCUBA-supported formats. In addition to performance, SCUBA offers several advantages over object conversion for the visualization and further analysis of pre-processed single-cell data. First, SCUBA extracts only data required for the operation at hand, leaving the original object unmodified. This process is simpler, less error prone, and less memory intensive than object conversion, which operates on the entire dataset. Second, code written with SCUBA can use any supported object class as input, with simple and consistent syntax across object formats. This allows a single analysis script or package (like our interactive single-cell browser, scExploreR) to work seamlessly with multiple object types, reducing the complexity of the code and improving both readability and reproducibility. Adoption of SCUBA will ultimately improve collaboration and reproducible research in single-cell analysis by lowering the barriers between package ecosystems.

Keywords

single-cell sequencing, multimodal, software tools, R package, Python, visualization

Corresponding author: Austin E. Gillen

Competing interests: No competing interests were disclosed.

Grant information: This work received support from US VA IK2BX004952-01A1 to AEG and US NIH R35CA242376 to CTJ.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2024 Showers WM et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Showers WM, Desai J, Engel KL et al. SCUBA implements a storage format-agnostic API for single-cell data access in R [version 1; peer review: 1 approved, 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:1256 (https://doi.org/10.12688/f1000research.154675.1) First published: 21 Oct 2024, 13:1256 (https://doi.org/10.12688/f1000research.154675.1) Latest published: 02 Jun 2025, 13:1256 (https://doi.org/10.12688/f1000research.154675.2)

Introduction

The rapidly evolving landscape of single-cell sequencing methods has led to the production of increasingly large and diverse single-cell datasets, greatly improving our knowledge of both inter- and intra-patient heterogeneity in a wide range of diseases and normal tissues.^1,2 While there are many excellent tools available for analyzing single-cell datasets in both the Python and R ecosystems, interoperability is hindered by the use of incompatible object classes (Figure 1A). Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis. Single-cell analysis can be inaccessible to bench scientists due to the programming experience required, and different object formats make it even more difficult for biologists to analyze data. The implementation of popular object formats in both Python and R requires users to be fluent in both programming languages, and it is difficult and time consuming to learn both without formal education in data structures and syntax in each language. The widespread use of multiple object formats also makes it difficult to validate results produced by one analysis suite with another, and visualizations produced with different analysis suites are not consistent. Additionally, the practical analysis of objects with large numbers of cells requires a way to interact with on-disk matrices rather than loading all data in memory. On-disk matrix implementations are analysis suite-specific, which introduces additional barriers to effective analysis. For example, anndata objects are natively stored in the memory-efficient HDF5 format, but anndata objects are not compatible with the Bioconductor’s single-cell tools or Seurat. If a user converts an anndata object to Seurat or SingleCellExperiment format, they must use a different on-disk matrix implementation specific to that format, which further restricts the analyses that can be performed. If single-cell object formats were interoperable, it would be easy for researchers to analyze data from any single-cell dataset, regardless of the object format used when the dataset was generated, but unfortunately this is not the case.

Figure 1. SCUBA addresses challenges posed by multiple object formats in single-cell sequencing data.

A) Raw single-cell sequencing data is stored in defined object classes, and processed downstream by packages that only accept one object class. This creates “walled garden” analysis suites of incompatible packages that complicate single-cell analysis. When a specific downstream package is desired, the user will need to convert between object formats prior to use. This is possible, but the process is difficult and error-prone. B) SCUBA returns feature expression data, metadata, and reduction coordinates from Seurat, SingleCellExperiment, and Anndata objects in a consistent output format. An overview of each object structure is shown, with rectangles indicating data matrices stored in each object. Dimensions of the matrices are labeled with “cells” or “genes” (features), and matrices placed adjacent to one another indicate requirements that matrices have the same number of values in the dimension indicated (i.e. for Seurat objects, the “reduction coordinates” and “gene expression” matrices must have the same number of cells, but may have varying number of genes (or in the case of reduction coordinates, dimensions). Next to the description of each matrix, object-specific code to retrieve the matrix is given. If additional modalities are supported by an object type, the structure of matrices specific to modalities are shown, along with code to retrieve data on alternate modalities. The output format for SCUBA is shown at the bottom of the panel. The output is a single R data.frame with values for each variable requested for each cell. The S3 methods added by SCUBA to yield the output format are shown in blue, and are based on the existing FetchData generic and method from Seurat.

Currently, the most effective solution is to manually convert between object classes. All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy³ are specifically designed for these conversions. However, inconsistencies in approaches to object structure across implementations often result in data loss upon conversion, which is difficult to overcome. Additionally, the rapid development of these packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. This is especially true when converting to and from the anndata format, since this format is implemented in the Python programming language, and the Seurat and SingleCellExperiment objects are implemented in the R programming language. Even if conversion is successfully achieved without loss of data quality, it has recently been demonstrated that results from Seurat differ from those of Scanpy,⁴ despite the fact the two packages implement ostensibly identical processing steps. Addressing interoperability and consistency issues between analysis suites is crucial to ensuring the fidelity and reproducibility of single-cell analysis results, making consistent visualizations across suites essential.

Rather than conversion between objects, we propose a more sustainable approach to the visualization and further analysis of pre-processed single-cell data by implementing a unified API for all common single-cell object formats. Here, we present Single-Cell Unified Backend API (SCUBA), an R package based on the data accession function in the widely used Seurat⁵ package that returns data from Seurat, SingleCellExperiment, and annadata objects in a common format for downstream visualization and analysis (Figure 1B). SCUBA also implements new data-specific access functions for all supported object types. Data is returned in a single R data.frame, with requested variables as columns, and cells as rows. The functions in this package allow users to plot data in a consistent manner from these object types in R, without requiring conversion. SCUBA can also be used in functional programming applications as the basis for single-cell plotting packages, or in the development of Shiny apps. For objects with very large numbers of cells, it is now possible to choose the object class based on on-disk storage performance and produce visually consistent plots without having to downsample the object. Packages and scripts created with SCUBA are flexible with regard to input type, greatly improving the consistency of results between objects and increasing accessibility of these analyses for non-programmers.

Methods

Implementation

SCUBA provides a unified framework for data access by leveraging R’s S3 object-oriented programming.⁶ The workflow for data access in SCUBA is based on Seurat’s FetchData method. We implemented a new generic that uses the Seurat FetchData method for Seurat objects, and novel methods for SingleCellExperiment and anndata objects. The generic was chosen as a basis due to its ease of use, and its implementation in Seurat’s plotting functions, which are widely used.

S3 methods were created in SCUBA to extend the existing FetchData generic to SingleCellExperiment and anndata objects. Access to the anndata objects is accomplished using reticulate⁷ and performing as many operations in python as possible before returning data to R. The workflow from the existing method for Seurat objects is largely unchanged upon re-implementation for these object formats. We used code from the Seurat package under the terms of the package’s MIT license.

In addition to replicating the behavior of FetchData in SingleCellExperiment and anndata objects, SCUBA includes S3 generics and methods specific to the retrieval of metadata and reduction coordinates from each object format. These methods offer improvements in performance relative to retrieving the same data via FetchData for large objects.

Operation

SCUBA can be installed as an R package via GitHub using the devtools⁸ R package, and can be used on all common operating systems. To ensure compatibility across operating systems, SCUBA is maintained using Continuous Integration (CI), with Github Actions and the testthat⁹ R package. The Github Actions workflow performs 100+ tests on recent Linux R and Python versions whenever a pull request is created. Tests are additionally run on Mac OS and Windows for releases. The dataset used for testing is a downsampled version of the acute myeloid leukemia reference dataset¹⁰ from Triana et al. 2021.¹¹

If using SCUBA with Seurat or SingleCellExperiment objects, no further installation is necessary beyond the R dependencies. For anndata objects, the reticulate⁷ R package and a Python installation are required. The following python packages must be manually installed: pandas,¹² numpy,¹³ scipy,¹⁴ and anndata.¹⁵ We recommend installing these packages in an anaconda¹⁶ environment and loading the environment in R with reticulate::use_condaenv(), but this is not required. Detailed installation instructions are available on the SCUBA GitHub Page.

Use cases

The features of SCUBA fall broadly into three categories; data access, data visualization, and data exploration. The functions provided in these categories can be used independently or in a stepwise pipeline. Generally speaking, SCUBA works best for objects that have been filtered and clustered, though SCUBA can work on objects in any state as long as the data being requested exists. Here we highlight independent use cases using a downsampled version of the acute myeloid leukemia reference dataset¹⁰ generated by Triana et al.¹¹ Additional vignettes are provided on the SCUBA GitHub page.

FetchData Methods for SingleCellExperiment and Anndata Objects

Example usage of SCUBA’s FetchData methods is given Figure 2. The existing Seurat method (first column) is compared to the methods added by SCUBA. There are only minor variations in input syntax across the three supported object types, and the required parameters are few, making the methods easy to use. All data requested is specified using the vars parameter. The methods infer whether the data requested is metadata, reduction coordinates, or feature expression by parsing the character vector passed to this parameter. To retrieve feature expression or reduction coordinates, the user adds a “key” with an underscore giving the name of the reduction, or the modality to pull feature expression data from. If using an object with only one modality, the modality key is not needed, and the key is also not needed to retrieve metadata. Minor variations in the key exist between object types, due to object-specific conventions for naming modalities (which are called “assays” in Seurat objects, and “Experiments” in SingleCellExperiment objects). Variations in the layer parameter are based on variations in conventions for naming layers (in SingleCellExperiment objects, “assays”, and in Seurat v4 and earlier, “slots”). The consistency in parameters between the three object types, and the presence of only minor differences in inputs to each parameter, facilitates the writing of scripts for any object type.

Figure 2. The methods added by SCUBA simplify the retrieval of data from supported object classes.

The existing seurat method (first column), is compared to the methods added by SCUBA for SingleCellExperiment and anndata objects (second and third columns). The methods use consistent syntax across object classes, and involve the use of only a few parameters. Pseudocode is used in the examples. object represents a single-cell object. features represents one or more features, from any modality in the object. metadata represents one or more metadata variables, for example, cell type classifications. reduction_dims represents a set of dimensions in a reduction included with the object, with the number of the dimension separated from the reduction with an underscore. For example, to fetch the first and second dimensions of the UMAP projection, reduction_dims would be c(“UMAP_1”, “UMAP_2”).

The output of FetchData is identical across the three object classes. The output is an R data.frame with values for each requested feature in vars per cell. Columns represent each feature, and rows represent cells.

Metadata, reduction-specific accession methods

SCUBA also includes S3 generics and methods specific to the retrieval of metadata and reduction coordinates, which are faster than retrieving the same data via FetchData for R object types. An overview of the fetch_metadata and fetch_reduction functions is given in Figure 3A. As with FetchData, the output of fetch_metadata and fetch_reduction is an R data.frame with data for the requested metadata variables or reduction coordinates, respectively, as columns, and rows for each cell. Figure 3B compares the usage of fetch_reduction and fetch_metadata between supported object types. We implement these methods in anndata objects for consistency in syntax, but their performance is roughly equivalent to FetchData. The functions are easy to use, and the inputs to each function do not vary based on object type. To set defaults for the reduction and cells parameters of fetch_reduction, SCUBA provides several utility methods. default_reduction will search for UMAP, t-SNE, and PCA reductions, and will return them in that order if they exist. get_all_cells will return the IDs of all cells in the object.

Figure 3. SCUBA methods specific to the retrieval of metadata and reduction coordinates.

A) Overview of outputs of fetch_metadata, for metadata variables, and fetch_reduction, for reduction coordinates. The output is an R data.frame with the metadata or reduction coordinates as columns, and the cells as rows. B) Comparison of the usage of fetch_metadata and fetch_reduction across each object type. For fetch_metadata, the metadata variable or variables to retrieve (which are represented as metadata in this pseudocode example) are specified via a character vector input to vars. For fetch_reduction, the dimensions to return from the reduction coordinate matrix is passed to dims, and the reduction to pull from is specified via reduction. The cells parameter allows the user to specify which cells to fetch reduction coordinates for. For ease of use and flexibility, there is no difference in inputs between object types; only the object itself varies.

Figure 4A-B compares the performance of fetch_metadata and fetch_reduction with the performance of FetchData to pull one metadata variable, and the first and second dimensions of UMAP coordinates, respectively. Run time was tested for each function on random subsets of varying numbers of cells, with five subsets created for each size. The fetch_metadata and fetch_reduction methods were more performant than FetchData in Seurat and SingleCellExperiment objects for all subsets tested. In anndata objects, the runtime of these functions was comparable to that of FetchData. Performance testing for the FetchData methods added by SCUBA was also performed (Figure 4C). Performance of the method for anndata objects exceeds the performance for the existing Seurat method in most cases, and the performance of the SingleCellExperiement method exceeds performance of the Seurat method for the largest subset tested (500k cells).

Figure 4. Performance testing of SCUBA functions and methods.

Five random subsets of the indicated numbers of cells were created from the Human Brain Atlas object downloaded from CellXGene.²¹ The subsets were saved in the following object formats: Seurat, via saveRDS(), SingleCellExperiment, via HDF5Array::saveHDF5SummarizedExperiment(), and anndata, via write_h5ad. For all tests, the indicated operations were run on each of the five subsets, and the run time was measured using the tictoc²² package. A) Comparison of FetchData methods vs. fetch_metadata for the retrieval of data for a single metadata variable. In most cases, using fetch_metadata to pull metadata was more performant than using FetchData. B) Comparison of FetchData methods vs. fetch_reduction for the retrieval of data for a pair of reduction coordinates. fetch_reduction was more performant than FetchData for the retrieval of reduction coordinates in most cases. C) Performance of the FetchData methods developed for SingleCellExperiment and anndata objects, compared to the existing FetchData method for Seurat objects. A single feature was pulled via FetchData for each of the random subsets for the indicated object and number of cells.

Example scripts created with SCUBA

Figure 5 gives an example usage of SCUBA methods to create plots with consistent visuals across object types. Figure 5A shows the scripts to create a density plot showing expression by cell type from each of the three supported object types, showing regions of the script that vary between object types, and regions that are conserved. Figure 5B shows the output of the example script. The script demonstrates the ease at which expression data can be visualized from each object format, and the ease at which plot visuals can be harmonized across object formats.

Figure 5. SCUBA enables flexible plotting scripts harmonized across object types.

A) Example script for visualizing expression of a gene by cluster in a density plot for each of the three supported object types. The three boxes for FetchData indicate slight variations in the script for each object type. All downstream code is the same across object formats. B) Output of the plotting script in (A). Output does not vary by object type.

Any plot visualizing a combination expression data, metadata, and reduction coordinates can be created by generating a table from FetchData, fetch_metadata, or fetch_reduction, and passing the output table to downstream plotting code. Plotting is performed via ggplot2¹⁷ in this example, but any other plotting package that accepts a data.frame or a tibble as input may be used. If desired, it is also possible to convert to a pandas¹² dataframe via Reticulate,⁷ and perform plotting operations in python. The flexibility of SCUBA’s data access methods facilitate the creation of a broad variety of plots from single-cell data.

Figure 6 shows an example script that simplifies the printing of unique values of a metadata variable represented in an object, which is a commonly used basic operation in analysis. With SCUBA, this operation can simply be performed by calling fetch_metadata on the object and piping the results to unique(). The language used is the same for all supported object classes, which negates the need to memorize and use the most efficient function calls for each respective object type.

Figure 6. SCUBA simplifies common object exploration operations.

This figure compares the usage of SCUBA with the most efficient equivalent operations for viewing the unique values of a metadata variable represented in an object. The operation with SCUBA is shown in the first column, and the most efficient equivalents are shown in the second column. SCUBA simplifies this operation, allowing for the development of scripts that are generalized for multiple object types.

Conclusions

SCUBA addresses issues with interoperability between single-cell object formats by providing a flexible backend that returns data in a consistent format, via a consistent interface. The consistent output format of SCUBA facilitates downstream use in functional programming applications (plotting scripts, packages, etc.) and allows for consistent visualizations across object types. Packages and scripts using SCUBA will not require object conversion prior to use, conferring several advantages for end users. Users will not have to risk data loss upon object conversion, and analysis will be more straightforward without conversion, requiring less programming experience. Packages made with SCUBA will also allow users to choose object classes based on storage and performance characteristics that are best for the specific dataset, rather than being constrained to a class based on downstream packages. SCUBA does not allow users to use any analysis package with any object format, however. The aforementioned benefits only apply to packages and scripts created using SCUBA. SCUBA also only performs data access operations, and is not for object assembly, clustering, or filtering. Because of this, SCUBA is not a replacement for analysis packages such as Seurat and Scanpy. Instead, SCUBA allows users to visualize objects that have been prepared with these analysis packages in the same manner, regardless of object class.

Support for MuData^18,19 will be added in the future, as this Python object class is especially useful for storing data from multimodal single-cell sequencing experiments. SCUBA is particularly well suited for interactive use, such as in Shiny apps, where multiple object formats may be used as inputs. We developed a single-cell browser, scExploreR,²⁰ that allows users to create consistent Seurat-style visualizations from either Seurat, SingleCellExperiment, or anndata objects. SCUBA can also be used to create a plotting package that produces visuals from any supported object class for reports and shiny apps, and a QC package reporting the results of preprocessing steps such as filtering, clustering, and batch correction could also be created using SCUBA. The flexibility of SCUBA is envisioned to facilitate analysis and visualization of preprocessed data, unifying disparate object-based package ecosystems.

Ethics and consent

Ethical approval and consent were not required.

Data and software availability

SCUBA uses two third-party datasets for performance benchmarking, testing, and demonstration in the manuscript. The datasets are described below.

Figshare: Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor. https://doi.org/10.6084/m9.figshare.13398065.v4.¹⁰

This project contains the following underlying data:

• 200AB_projected.rds. (Seurat object with 15821 cells, showing the expression of 197 surface markers and 462 mRNAs in bone marrow from a young healthy donor).

The dataset is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

CELLxGENE: Human Brain Cell Atlas v1.0. https://cellxgene.cziscience.com/collections/283d65eb-dd53-496d-adb7-7570c7caa443.

This project contains the following underlying data:

• cc9bfb86-96ed-4ecd-bcc9-464120fc8628.rds. (Seurat object with 800k non-neuronal cells used for performance benchmarking in the manuscript. The file is accessed by selecting “All non-neuronal cells” and then the.rds radio button).

The dataset is available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

The Velten et al. dataset¹⁰ was processed to yield a format suitable for testing and demonstration of SCUBA, downsampled, and stored in the inst/extdata/ and data/directories of the SCUBA repo. Scripts used in these operations and performance benchmarking are available at the manuscript GitHub repo: https://github.com/amc-heme/SCUBA_Manuscript. Working examples of code shown in figures 2, 3, 5, and 6 are also stored in this repo.

Software, up to date source code, and tutorials are available from: https://github.com/amc-heme/scuba

Archived source code at time of publication: https://zenodo.org/doi/10.5281/zenodo.13776167

License: MIT

Acknowledgements

The authors would like to acknowledge Monica Ransom, Sarah E. Staggs, Stephanie R. Gipson, Abbigayl Burtis, and Devin Burke for their thoughtful comments and suggestions during the development of this package and the writing of this manuscript.

References

1. Schäfer PSL, Dimitrov D, Villablanca EJ, et al.: Integrating single-cell multi-omics and prior biological knowledge for a functional characterization of the immune system. Nat. Immunol. 2024; 25: 405–417. PubMed Abstract | Publisher Full Text
2. Zeng AGX, et al.: A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia. Nat. Med. 2022; 28: 1212–1223. PubMed Abstract | Publisher Full Text
3. Kiselev V, Huang N: sceasy.2022.
4. Wolf FA, Angerer P, Theis FJ: SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018; 19: 15. PubMed Abstract | Publisher Full Text | Free Full Text
5. Hao Y, et al.: Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024; 42: 293–304. PubMed Abstract | Publisher Full Text | Free Full Text
6. Wickham H: S3. Advanced R. Chapman and Hall/CRC; 2019. Publisher Full Text
7. Ushey K, Allaire J, Tang Y: reticulate: Interface to ‘Python’.2023.
8. Wickham H, Hester J, Chang W, et al.: devtools: Tools to Make Developing R Packages Easier.2022.
9. Wickham H: testthat: Get Started with Testing. The R Journal. 2011; 3: 5. Publisher Full Text
10. Velten L, Triana S, Haas S, et al.: Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor. [Dataset]. Figshare. 2021. Publisher Full Text
11. Triana S, et al.: Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat. Immunol. 2021; 22: 1577–1589. PubMed Abstract | Publisher Full Text | Free Full Text
12. The pandas development team: Pandas.2023. Publisher Full Text
13. Harris CR, et al.: Array programming with NumPy. Nature. 2020; 585: 357–362. PubMed Abstract | Publisher Full Text | Free Full Text
14. Virtanen P, et al.: Author Correction: SciPy 1.0: fundamental algorithms for scientific computing in Python (Nature Methods, (2020), 10.1038/s41592-019-0686-2). Nat. Methods. 2020; 17: 352–352. PubMed Abstract | Publisher Full Text | Free Full Text
15. Virshup I, Rybakov S, Theis FJ, et al.: anndata: Annotated data.2021. 2021.12.16.473007. Publisher Full Text
16. Conda contributors: Conda: A system-level, binary package and environment manager running on all major operating systems and platforms.2024.
17. Wickham H: Ggplot2: Elegant Graphics for Data Analysis. Switzerland: Springer; 2016. Publisher Full Text
18. Bredikhin D, Kats I, Stegle O: MUON: multimodal omics analysis framework. Genome Biol. 2022; 23: 42. PubMed Abstract | Publisher Full Text | Free Full Text
19. Virshup I, et al.: The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 2023; 41: 604–606. PubMed Abstract | Publisher Full Text
20. Showers W, Desai J, Gipson S, et al.: scExploreR: a Flexible Shiny App for Democratized Analysis of Multimodal single-cell RNA-seq Data.2024.
21. Siletti K, et al.: Human Brain Cell Atlas v1.0. [Dataset]. CELLxGENE. Reference Source2023.
22. Izrailev S: tictoc: Functions for Timing R Scripts, as Well as Implementations of ‘Stack’ and ‘StackList’ Structures.2023.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 21 Oct 2024

Author details Author details

William M. Showers
Roles: Conceptualization, Data Curation, Software, Writing – Original Draft Preparation, Writing – Review & Editing

Jairav Desai
Roles: Software, Writing – Review & Editing

Krysta L. Engel
Roles: Writing – Review & Editing

Clayton Smith
Roles: Funding Acquisition

Craig T. Jordan
Roles: Funding Acquisition

Austin E. Gillen
Roles: Conceptualization, Funding Acquisition, Software, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work received support from US VA IK2BX004952-01A1 to AEG and US NIH R35CA242376 to CTJ.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 02 Jun 2025, 13:1256

https://doi.org/10.12688/f1000research.154675.2

version 1

Published: 21 Oct 2024, 13:1256

https://doi.org/10.12688/f1000research.154675.1

© 2024 Showers WM et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Showers WM, Desai J, Engel KL et al. SCUBA implements a storage format-agnostic API for single-cell data access in R [version 1; peer review: 1 approved, 1 approved with reservations, 1 not approved]. F1000Research 2024, 13:1256 (https://doi.org/10.12688/f1000research.154675.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 21 Oct 2024

Views

Reviewer Report 29 Jan 2025

Damian Panas, International Centre for Translational Eye Research, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Mazowieckie, Poland; ICTER, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland

Marcin Tabaka, International Centre for Translaional Eye Research, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland

Not Approved

https://doi.org/10.5256/f1000research.169727.r350233

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats; 3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Other comments:

1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4: fetch_metadata is slower than “FetchData” from Scanpy for anndata?

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Single-cell Genomics, Bioinformatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 16 Jun 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

16 Jun 2025

Author Response

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see ... Continue reading Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats; 3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Thank you for this assessment. The observation that SCUBA is a set of functions for accessing data from single-cell objects is correct, but we disagree that SCUBA should not be a standalone package. The functions provided by SCUBA are useful both in fetching data for interactive report generation (the usage context for scExploreR), and for static report generation by bioinformaticians. The consistent usage of SCUBA across object formats facilitates, for example, re-running a specific visualization script on data from a group that uses a different object class. SCUBA is a tool for developers to work seamlessly across different object classes, and we envision it as the basis of an analysis suite of static visualization packages, as well as the basis of scExploreR. We improved our documentation with examples of how users would integrate SCUBA into their analysis scripts.

Other comments:
1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.

While it is true that SCUBA is vulnerable to changes in underlying object class implementation and must be maintained, we see this vulnerability as being less severe than with conversion packages. Conversion methods such as those mentioned aim to convert the entirety of each object class to another class, while SCUBA functions each extract small subsets of data. While still requiring active maintenance to track upstream changes, we believe this atomic approach is substantially less vulnerable to breaking changes that affect the entire package.

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

Thank you for the suggestion. We don’t feel this is in scope, as the API for data access in CELLxGENE and other visualization software is not exposed to end users in the way that Seurat’s FetchData is.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.

Thank you for pointing this out. The function mentioned for retrieval of gene expression data was benchmarked in figure 4A. The captions for figure 4C and figure 4A were switched, which incorrectly suggested that figure 4A was benchmarking fetch_reduction. We regret the confusion caused by this error and have corrected it in the revised version. Figure 4A has also been changed to more clearly reflect that the performance of the original Seruat FetchData is being compared to the methods added in SCUBA.

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

We did not. Figure 4A has been updated to make this clear. In addition, we created a snake case fetch_data generic in response to comment 8, below. `fetch_data.Seurat` is simply a wrapper for Seurat’s FetchData, while `fetch_data.SingleCellExperiment` and `fetch_data.AnnDataR6` are added by SCUBA. This has been made clear in the manuscript and the updated documentation.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4: fetch_metadata is slower than “FetchData” from Scanpy for anndata?

As mentioned above, we feel that the new generic fetch_data makes this clear. Figure 4B and figure 4C compare the `fetch_data` method in SCUBA to `fetch_metadata` and `fetch_reduction`, respectively.

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.
Thanks for this observation. We agree in the context of interactive analysis in an R session, but disagree in the context of interactive web apps that plot data from single-cell objects with many cells. Regardless, while this is a valid point, we feel the main benefit of SCUBA is the consistency in syntax across object formats.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

While SCUBA does not fully overcome this problem, we see SCUBA as being a dependency of packages that would do so. We envision an analysis suite with packages for analyses such as differential gene expression and gene set enrichment analysis, as well as static and interactive visualizations. Packages built on SCUBA access functions will be flexible by default to input object class, and should be easier for end users to use compared to running a function from a conversion package, and then running an existing function from an analysis package. We also expect packages based on SCUBA to be easier to develop, since the access functions are easier to use than running class-specific access code, and have consistent syntax.

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

We appreciate these observations. In addition to overhauling the function documentation for clarity, we added a pkgdown site at https://amc-heme.github.io/SCUBA/. In the site, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets.
As suggested, we created a new generic fetch_data, and moved the pascal case FetchData methods added by SCUBA to this generic. The fetch_data method for Seurat is a wrapper for Seurat’s FetchData method, and the fetch_data methods for SingleCellExperiment and anndata are added by SCUBA. In addition to improving clarity, the fetch_data generic removes the need for a dependency on Seurat.
Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats; 3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Thank you for this assessment. The observation that SCUBA is a set of functions for accessing data from single-cell objects is correct, but we disagree that SCUBA should not be a standalone package. The functions provided by SCUBA are useful both in fetching data for interactive report generation (the usage context for scExploreR), and for static report generation by bioinformaticians. The consistent usage of SCUBA across object formats facilitates, for example, re-running a specific visualization script on data from a group that uses a different object class. SCUBA is a tool for developers to work seamlessly across different object classes, and we envision it as the basis of an analysis suite of static visualization packages, as well as the basis of scExploreR. We improved our documentation with examples of how users would integrate SCUBA into their analysis scripts.

Other comments:
1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.

While it is true that SCUBA is vulnerable to changes in underlying object class implementation and must be maintained, we see this vulnerability as being less severe than with conversion packages. Conversion methods such as those mentioned aim to convert the entirety of each object class to another class, while SCUBA functions each extract small subsets of data. While still requiring active maintenance to track upstream changes, we believe this atomic approach is substantially less vulnerable to breaking changes that affect the entire package.

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

Thank you for the suggestion. We don’t feel this is in scope, as the API for data access in CELLxGENE and other visualization software is not exposed to end users in the way that Seurat’s FetchData is.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.

Thank you for pointing this out. The function mentioned for retrieval of gene expression data was benchmarked in figure 4A. The captions for figure 4C and figure 4A were switched, which incorrectly suggested that figure 4A was benchmarking fetch_reduction. We regret the confusion caused by this error and have corrected it in the revised version. Figure 4A has also been changed to more clearly reflect that the performance of the original Seruat FetchData is being compared to the methods added in SCUBA.

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

We did not. Figure 4A has been updated to make this clear. In addition, we created a snake case fetch_data generic in response to comment 8, below. `fetch_data.Seurat` is simply a wrapper for Seurat’s FetchData, while `fetch_data.SingleCellExperiment` and `fetch_data.AnnDataR6` are added by SCUBA. This has been made clear in the manuscript and the updated documentation.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4: fetch_metadata is slower than “FetchData” from Scanpy for anndata?

As mentioned above, we feel that the new generic fetch_data makes this clear. Figure 4B and figure 4C compare the `fetch_data` method in SCUBA to `fetch_metadata` and `fetch_reduction`, respectively.

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.
Thanks for this observation. We agree in the context of interactive analysis in an R session, but disagree in the context of interactive web apps that plot data from single-cell objects with many cells. Regardless, while this is a valid point, we feel the main benefit of SCUBA is the consistency in syntax across object formats.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

While SCUBA does not fully overcome this problem, we see SCUBA as being a dependency of packages that would do so. We envision an analysis suite with packages for analyses such as differential gene expression and gene set enrichment analysis, as well as static and interactive visualizations. Packages built on SCUBA access functions will be flexible by default to input object class, and should be easier for end users to use compared to running a function from a conversion package, and then running an existing function from an analysis package. We also expect packages based on SCUBA to be easier to develop, since the access functions are easier to use than running class-specific access code, and have consistent syntax.

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

We appreciate these observations. In addition to overhauling the function documentation for clarity, we added a pkgdown site at https://amc-heme.github.io/SCUBA/. In the site, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets.
As suggested, we created a new generic fetch_data, and moved the pascal case FetchData methods added by SCUBA to this generic. The fetch_data method for Seurat is a wrapper for Seurat’s FetchData method, and the fetch_data methods for SingleCellExperiment and anndata are added by SCUBA. In addition to improving clarity, the fetch_data generic removes the need for a dependency on Seurat.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

16 Jun 2025

Author Response

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see ... Continue reading Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats; 3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Thank you for this assessment. The observation that SCUBA is a set of functions for accessing data from single-cell objects is correct, but we disagree that SCUBA should not be a standalone package. The functions provided by SCUBA are useful both in fetching data for interactive report generation (the usage context for scExploreR), and for static report generation by bioinformaticians. The consistent usage of SCUBA across object formats facilitates, for example, re-running a specific visualization script on data from a group that uses a different object class. SCUBA is a tool for developers to work seamlessly across different object classes, and we envision it as the basis of an analysis suite of static visualization packages, as well as the basis of scExploreR. We improved our documentation with examples of how users would integrate SCUBA into their analysis scripts.

Other comments:
1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.

While it is true that SCUBA is vulnerable to changes in underlying object class implementation and must be maintained, we see this vulnerability as being less severe than with conversion packages. Conversion methods such as those mentioned aim to convert the entirety of each object class to another class, while SCUBA functions each extract small subsets of data. While still requiring active maintenance to track upstream changes, we believe this atomic approach is substantially less vulnerable to breaking changes that affect the entire package.

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

Thank you for the suggestion. We don’t feel this is in scope, as the API for data access in CELLxGENE and other visualization software is not exposed to end users in the way that Seurat’s FetchData is.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.

Thank you for pointing this out. The function mentioned for retrieval of gene expression data was benchmarked in figure 4A. The captions for figure 4C and figure 4A were switched, which incorrectly suggested that figure 4A was benchmarking fetch_reduction. We regret the confusion caused by this error and have corrected it in the revised version. Figure 4A has also been changed to more clearly reflect that the performance of the original Seruat FetchData is being compared to the methods added in SCUBA.

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

We did not. Figure 4A has been updated to make this clear. In addition, we created a snake case fetch_data generic in response to comment 8, below. `fetch_data.Seurat` is simply a wrapper for Seurat’s FetchData, while `fetch_data.SingleCellExperiment` and `fetch_data.AnnDataR6` are added by SCUBA. This has been made clear in the manuscript and the updated documentation.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4: fetch_metadata is slower than “FetchData” from Scanpy for anndata?

As mentioned above, we feel that the new generic fetch_data makes this clear. Figure 4B and figure 4C compare the `fetch_data` method in SCUBA to `fetch_metadata` and `fetch_reduction`, respectively.

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.
Thanks for this observation. We agree in the context of interactive analysis in an R session, but disagree in the context of interactive web apps that plot data from single-cell objects with many cells. Regardless, while this is a valid point, we feel the main benefit of SCUBA is the consistency in syntax across object formats.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

While SCUBA does not fully overcome this problem, we see SCUBA as being a dependency of packages that would do so. We envision an analysis suite with packages for analyses such as differential gene expression and gene set enrichment analysis, as well as static and interactive visualizations. Packages built on SCUBA access functions will be flexible by default to input object class, and should be easier for end users to use compared to running a function from a conversion package, and then running an existing function from an analysis package. We also expect packages based on SCUBA to be easier to develop, since the access functions are easier to use than running class-specific access code, and have consistent syntax.

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

We appreciate these observations. In addition to overhauling the function documentation for clarity, we added a pkgdown site at https://amc-heme.github.io/SCUBA/. In the site, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets.
As suggested, we created a new generic fetch_data, and moved the pascal case FetchData methods added by SCUBA to this generic. The fetch_data method for Seurat is a wrapper for Seurat’s FetchData method, and the fetch_data methods for SingleCellExperiment and anndata are added by SCUBA. In addition to improving clarity, the fetch_data generic removes the need for a dependency on Seurat.
Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats; 3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Thank you for this assessment. The observation that SCUBA is a set of functions for accessing data from single-cell objects is correct, but we disagree that SCUBA should not be a standalone package. The functions provided by SCUBA are useful both in fetching data for interactive report generation (the usage context for scExploreR), and for static report generation by bioinformaticians. The consistent usage of SCUBA across object formats facilitates, for example, re-running a specific visualization script on data from a group that uses a different object class. SCUBA is a tool for developers to work seamlessly across different object classes, and we envision it as the basis of an analysis suite of static visualization packages, as well as the basis of scExploreR. We improved our documentation with examples of how users would integrate SCUBA into their analysis scripts.

Other comments:
1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.

While it is true that SCUBA is vulnerable to changes in underlying object class implementation and must be maintained, we see this vulnerability as being less severe than with conversion packages. Conversion methods such as those mentioned aim to convert the entirety of each object class to another class, while SCUBA functions each extract small subsets of data. While still requiring active maintenance to track upstream changes, we believe this atomic approach is substantially less vulnerable to breaking changes that affect the entire package.

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

Thank you for the suggestion. We don’t feel this is in scope, as the API for data access in CELLxGENE and other visualization software is not exposed to end users in the way that Seurat’s FetchData is.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.

Thank you for pointing this out. The function mentioned for retrieval of gene expression data was benchmarked in figure 4A. The captions for figure 4C and figure 4A were switched, which incorrectly suggested that figure 4A was benchmarking fetch_reduction. We regret the confusion caused by this error and have corrected it in the revised version. Figure 4A has also been changed to more clearly reflect that the performance of the original Seruat FetchData is being compared to the methods added in SCUBA.

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

We did not. Figure 4A has been updated to make this clear. In addition, we created a snake case fetch_data generic in response to comment 8, below. `fetch_data.Seurat` is simply a wrapper for Seurat’s FetchData, while `fetch_data.SingleCellExperiment` and `fetch_data.AnnDataR6` are added by SCUBA. This has been made clear in the manuscript and the updated documentation.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4: fetch_metadata is slower than “FetchData” from Scanpy for anndata?

As mentioned above, we feel that the new generic fetch_data makes this clear. Figure 4B and figure 4C compare the `fetch_data` method in SCUBA to `fetch_metadata` and `fetch_reduction`, respectively.

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.
Thanks for this observation. We agree in the context of interactive analysis in an R session, but disagree in the context of interactive web apps that plot data from single-cell objects with many cells. Regardless, while this is a valid point, we feel the main benefit of SCUBA is the consistency in syntax across object formats.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

While SCUBA does not fully overcome this problem, we see SCUBA as being a dependency of packages that would do so. We envision an analysis suite with packages for analyses such as differential gene expression and gene set enrichment analysis, as well as static and interactive visualizations. Packages built on SCUBA access functions will be flexible by default to input object class, and should be easier for end users to use compared to running a function from a conversion package, and then running an existing function from an analysis package. We also expect packages based on SCUBA to be easier to develop, since the access functions are easier to use than running class-specific access code, and have consistent syntax.

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

We appreciate these observations. In addition to overhauling the function documentation for clarity, we added a pkgdown site at https://amc-heme.github.io/SCUBA/. In the site, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets.
As suggested, we created a new generic fetch_data, and moved the pascal case FetchData methods added by SCUBA to this generic. The fetch_data method for Seurat is a wrapper for Seurat’s FetchData method, and the fetch_data methods for SingleCellExperiment and anndata are added by SCUBA. In addition to improving clarity, the fetch_data generic removes the need for a dependency on Seurat.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 15 Jan 2025

Kristian Ullrich, Scientific IT group, Max Planck Institute for Evolutionary Biology, Plon, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.169727.r334007

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review
Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.
The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript has been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.
Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.
In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

Minor comments

Please change the DESCRIPTION and add the corresponding author/creators in the github repository.
Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.
Please remove the following example part from your github pages, sicne the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.
Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.
Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.
Please provide a working example for a default user without admin priviliges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.
Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.
Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

References

1. Cakir B, Prete M, Huang N, van Dongen S, et al.: Comparison of visualization tools for single-cell RNAseq data.NAR Genom Bioinform. 2020; 2 (3): lqaa052 PubMed Abstract | Publisher Full Text
2. Feng H, Lin L, Chen J: scDIOR: single cell RNA-seq data IO software.BMC Bioinformatics. 2022; 23 (1): 16 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Comparative Genomics, Bioinformatics, R Programming

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 16 Jun 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

16 Jun 2025

Author Response

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see ... Continue reading Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review
Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

Thank you for this valuable feedback. While we do not intend to submit this package to CRAN or Bioconductor, we recognize the value of continuous integration and have ensured that the code now passes the R-CMD-check without warnings or errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

We have added a vignette with examples of all major SCUBA functions (see the comment below).

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

We appreciate this feedback, and have improved our documentation to address these important observations. Specifically, we have added a pkgdown site at https://amc-heme.github.io/SCUBA/. Here, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets. The vignette shows usage examples for all three object types currently supported by SCUBA. In addition to adding the vignette, we updated the function documentation to improve clarity and provide usage examples.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

Thanks for letting us know about this. We have removed reduction_names() and assay_names() from the documentation to avoid this confusion. Features_in_assay() was added back to the package, and documentation on usage of this function is now on our pkgdown site, in the “User Guide” article.

The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript have been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

We have double-checked the code in amc-heme/SCUBA_Manuscript internally for consistent outputs. Please let us know if you notice further issues with the code in the manuscript repo.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

Thank you for this observation. We don’t feel that it is necessary to prevent users from doing this, but we have added a warning to users (shown below) in this scenario. The warning appears when 1000 or more genes are requested.

"A very large number of features was requested (<> features). fetch_data is not intended to be used with feature queries of this length. Data is returned in a dense format, so the memory usage of the output may be very large. Also, this query may take a while to complete."

Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

Thank you for pointing this out. The sentence was changed as follows in the updated manuscript:

"All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, Zellconverter, and scDIOR are specifically designed for these conversions. Additionally, SeruatWrapper implements conversion functions that allow several otherwise-incompatible single-cell analysis packages to be used on Seurat objects."

Citations for the corresponding literature have also been added to the revised manuscript.

In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

We appreciate this observation. While we prefer not to require users to install anndata and reticulate if they are not using anndata objects, we have added a conditional statement in all exported AnnDataR6 methods to check if these packages are installed, and throw an error directing users to install these packages if they are not.

Minor comments
Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

This change was made.

Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

To keep users from having to import Seurat to use our FetchData methods, we moved them to a new generic defined within our package (fetch_data).

Please remove the following example part from your github pages, since the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

All three functions have been removed from the README on our GitHub page. We added `features_in_assay()` back to SCUBA, and have added documentation for the function in our user guide vignette.

Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

This has been corrected. All fetch_* functions and all functions that take an object as a parameter now mention all object types corrected. The description text for the `object` parameter is now also consistent across all functions.

Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

We have modified this sentence to the following: “Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data”.

Please provide a working example for a default user without admin privileges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

Thanks for this suggestion. We have added this to the README, under the section “Additional Installation for anndata Objects”.

Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

This has been changed. The sentence now reads “Example script for visualizing expression of a gene by cluster in a density plot.”

Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

This change has been made. Please see figure 6 in the revised submission.
Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review
Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

Thank you for this valuable feedback. While we do not intend to submit this package to CRAN or Bioconductor, we recognize the value of continuous integration and have ensured that the code now passes the R-CMD-check without warnings or errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

We have added a vignette with examples of all major SCUBA functions (see the comment below).

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

We appreciate this feedback, and have improved our documentation to address these important observations. Specifically, we have added a pkgdown site at https://amc-heme.github.io/SCUBA/. Here, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets. The vignette shows usage examples for all three object types currently supported by SCUBA. In addition to adding the vignette, we updated the function documentation to improve clarity and provide usage examples.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

Thanks for letting us know about this. We have removed reduction_names() and assay_names() from the documentation to avoid this confusion. Features_in_assay() was added back to the package, and documentation on usage of this function is now on our pkgdown site, in the “User Guide” article.

The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript have been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

We have double-checked the code in amc-heme/SCUBA_Manuscript internally for consistent outputs. Please let us know if you notice further issues with the code in the manuscript repo.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

Thank you for this observation. We don’t feel that it is necessary to prevent users from doing this, but we have added a warning to users (shown below) in this scenario. The warning appears when 1000 or more genes are requested.

"A very large number of features was requested (<> features). fetch_data is not intended to be used with feature queries of this length. Data is returned in a dense format, so the memory usage of the output may be very large. Also, this query may take a while to complete."

Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

Thank you for pointing this out. The sentence was changed as follows in the updated manuscript:

"All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, Zellconverter, and scDIOR are specifically designed for these conversions. Additionally, SeruatWrapper implements conversion functions that allow several otherwise-incompatible single-cell analysis packages to be used on Seurat objects."

Citations for the corresponding literature have also been added to the revised manuscript.

In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

We appreciate this observation. While we prefer not to require users to install anndata and reticulate if they are not using anndata objects, we have added a conditional statement in all exported AnnDataR6 methods to check if these packages are installed, and throw an error directing users to install these packages if they are not.

Minor comments
Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

This change was made.

Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

To keep users from having to import Seurat to use our FetchData methods, we moved them to a new generic defined within our package (fetch_data).

Please remove the following example part from your github pages, since the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

All three functions have been removed from the README on our GitHub page. We added `features_in_assay()` back to SCUBA, and have added documentation for the function in our user guide vignette.

Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

This has been corrected. All fetch_* functions and all functions that take an object as a parameter now mention all object types corrected. The description text for the `object` parameter is now also consistent across all functions.

Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

We have modified this sentence to the following: “Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data”.

Please provide a working example for a default user without admin privileges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

Thanks for this suggestion. We have added this to the README, under the section “Additional Installation for anndata Objects”.

Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

This has been changed. The sentence now reads “Example script for visualizing expression of a gene by cluster in a density plot.”

Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

This change has been made. Please see figure 6 in the revised submission.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 16 Jun 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

16 Jun 2025

Author Response

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see ... Continue reading Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review
Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

Thank you for this valuable feedback. While we do not intend to submit this package to CRAN or Bioconductor, we recognize the value of continuous integration and have ensured that the code now passes the R-CMD-check without warnings or errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

We have added a vignette with examples of all major SCUBA functions (see the comment below).

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

We appreciate this feedback, and have improved our documentation to address these important observations. Specifically, we have added a pkgdown site at https://amc-heme.github.io/SCUBA/. Here, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets. The vignette shows usage examples for all three object types currently supported by SCUBA. In addition to adding the vignette, we updated the function documentation to improve clarity and provide usage examples.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

Thanks for letting us know about this. We have removed reduction_names() and assay_names() from the documentation to avoid this confusion. Features_in_assay() was added back to the package, and documentation on usage of this function is now on our pkgdown site, in the “User Guide” article.

The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript have been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

We have double-checked the code in amc-heme/SCUBA_Manuscript internally for consistent outputs. Please let us know if you notice further issues with the code in the manuscript repo.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

Thank you for this observation. We don’t feel that it is necessary to prevent users from doing this, but we have added a warning to users (shown below) in this scenario. The warning appears when 1000 or more genes are requested.

"A very large number of features was requested (<> features). fetch_data is not intended to be used with feature queries of this length. Data is returned in a dense format, so the memory usage of the output may be very large. Also, this query may take a while to complete."

Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

Thank you for pointing this out. The sentence was changed as follows in the updated manuscript:

"All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, Zellconverter, and scDIOR are specifically designed for these conversions. Additionally, SeruatWrapper implements conversion functions that allow several otherwise-incompatible single-cell analysis packages to be used on Seurat objects."

Citations for the corresponding literature have also been added to the revised manuscript.

In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

We appreciate this observation. While we prefer not to require users to install anndata and reticulate if they are not using anndata objects, we have added a conditional statement in all exported AnnDataR6 methods to check if these packages are installed, and throw an error directing users to install these packages if they are not.

Minor comments
Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

This change was made.

Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

To keep users from having to import Seurat to use our FetchData methods, we moved them to a new generic defined within our package (fetch_data).

Please remove the following example part from your github pages, since the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

All three functions have been removed from the README on our GitHub page. We added `features_in_assay()` back to SCUBA, and have added documentation for the function in our user guide vignette.

Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

This has been corrected. All fetch_* functions and all functions that take an object as a parameter now mention all object types corrected. The description text for the `object` parameter is now also consistent across all functions.

Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

We have modified this sentence to the following: “Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data”.

Please provide a working example for a default user without admin privileges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

Thanks for this suggestion. We have added this to the README, under the section “Additional Installation for anndata Objects”.

Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

This has been changed. The sentence now reads “Example script for visualizing expression of a gene by cluster in a density plot.”

Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

This change has been made. Please see figure 6 in the revised submission.
Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review
Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

Thank you for this valuable feedback. While we do not intend to submit this package to CRAN or Bioconductor, we recognize the value of continuous integration and have ensured that the code now passes the R-CMD-check without warnings or errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

We have added a vignette with examples of all major SCUBA functions (see the comment below).

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

We appreciate this feedback, and have improved our documentation to address these important observations. Specifically, we have added a pkgdown site at https://amc-heme.github.io/SCUBA/. Here, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets. The vignette shows usage examples for all three object types currently supported by SCUBA. In addition to adding the vignette, we updated the function documentation to improve clarity and provide usage examples.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

Thanks for letting us know about this. We have removed reduction_names() and assay_names() from the documentation to avoid this confusion. Features_in_assay() was added back to the package, and documentation on usage of this function is now on our pkgdown site, in the “User Guide” article.

The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript have been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

We have double-checked the code in amc-heme/SCUBA_Manuscript internally for consistent outputs. Please let us know if you notice further issues with the code in the manuscript repo.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

Thank you for this observation. We don’t feel that it is necessary to prevent users from doing this, but we have added a warning to users (shown below) in this scenario. The warning appears when 1000 or more genes are requested.

"A very large number of features was requested (<> features). fetch_data is not intended to be used with feature queries of this length. Data is returned in a dense format, so the memory usage of the output may be very large. Also, this query may take a while to complete."

Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

Thank you for pointing this out. The sentence was changed as follows in the updated manuscript:

"All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, Zellconverter, and scDIOR are specifically designed for these conversions. Additionally, SeruatWrapper implements conversion functions that allow several otherwise-incompatible single-cell analysis packages to be used on Seurat objects."

Citations for the corresponding literature have also been added to the revised manuscript.

In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

We appreciate this observation. While we prefer not to require users to install anndata and reticulate if they are not using anndata objects, we have added a conditional statement in all exported AnnDataR6 methods to check if these packages are installed, and throw an error directing users to install these packages if they are not.

Minor comments
Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

This change was made.

Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

To keep users from having to import Seurat to use our FetchData methods, we moved them to a new generic defined within our package (fetch_data).

Please remove the following example part from your github pages, since the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

All three functions have been removed from the README on our GitHub page. We added `features_in_assay()` back to SCUBA, and have added documentation for the function in our user guide vignette.

Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

This has been corrected. All fetch_* functions and all functions that take an object as a parameter now mention all object types corrected. The description text for the `object` parameter is now also consistent across all functions.

Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

We have modified this sentence to the following: “Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data”.

Please provide a working example for a default user without admin privileges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

Thanks for this suggestion. We have added this to the README, under the section “Additional Installation for anndata Objects”.

Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

This has been changed. The sentence now reads “Example script for visualizing expression of a gene by cluster in a density plot.”

Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

This change has been made. Please see figure 6 in the revised submission.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 17 Dec 2024

Huamei Li, Nanjing University, Nanjing, Jiangsu, China

Approved

https://doi.org/10.5256/f1000research.169727.r346283

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, Immunogenetics, Single-cell and spatial technologies

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 09 Aug 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

09 Aug 2025

Author Response

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see ... Continue reading Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

Thank you for this observation. We have added new documentation via a pkgdown website (https://amc-heme.github.io/SCUBA/). The Article “User Guide” contains a note on how each object class is loaded. SCUBA is seamlessly compatible with Seurat v4 and v5 objects. We have added a note on this to the README.

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

SCUBA partially supports spatial datasets. Aspects of spatial datasets that can be expressed as a counts matrix can be loaded into SCUBA, but we do not yet support the loading of image files. We have added a note on this to the README.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

Thank you for the suggestion. While we agree that this is a valuable approach, we feel this is out of scope for the current tool, and welcome users to apply their own smoothing methods in plotting scripts based on SCUBA.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Thank you for the suggestion. We have added the suggested documentation to the README and the user guide vignette on the pkgdown website. We have also updated function documentation in SCUBA v.1.1.0.
Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

Thank you for this observation. We have added new documentation via a pkgdown website (https://amc-heme.github.io/SCUBA/). The Article “User Guide” contains a note on how each object class is loaded. SCUBA is seamlessly compatible with Seurat v4 and v5 objects. We have added a note on this to the README.

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

SCUBA partially supports spatial datasets. Aspects of spatial datasets that can be expressed as a counts matrix can be loaded into SCUBA, but we do not yet support the loading of image files. We have added a note on this to the README.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

Thank you for the suggestion. While we agree that this is a valuable approach, we feel this is out of scope for the current tool, and welcome users to apply their own smoothing methods in plotting scripts based on SCUBA.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Thank you for the suggestion. We have added the suggested documentation to the README and the user guide vignette on the pkgdown website. We have also updated function documentation in SCUBA v.1.1.0.
Competing Interests: We disclose no competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 09 Aug 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

09 Aug 2025

Author Response

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see ... Continue reading Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

Thank you for this observation. We have added new documentation via a pkgdown website (https://amc-heme.github.io/SCUBA/). The Article “User Guide” contains a note on how each object class is loaded. SCUBA is seamlessly compatible with Seurat v4 and v5 objects. We have added a note on this to the README.

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

SCUBA partially supports spatial datasets. Aspects of spatial datasets that can be expressed as a counts matrix can be loaded into SCUBA, but we do not yet support the loading of image files. We have added a note on this to the README.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

Thank you for the suggestion. While we agree that this is a valuable approach, we feel this is out of scope for the current tool, and welcome users to apply their own smoothing methods in plotting scripts based on SCUBA.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Thank you for the suggestion. We have added the suggested documentation to the README and the user guide vignette on the pkgdown website. We have also updated function documentation in SCUBA v.1.1.0.
Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

Thank you for this observation. We have added new documentation via a pkgdown website (https://amc-heme.github.io/SCUBA/). The Article “User Guide” contains a note on how each object class is loaded. SCUBA is seamlessly compatible with Seurat v4 and v5 objects. We have added a note on this to the README.

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

SCUBA partially supports spatial datasets. Aspects of spatial datasets that can be expressed as a counts matrix can be loaded into SCUBA, but we do not yet support the loading of image files. We have added a note on this to the README.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

Thank you for the suggestion. While we agree that this is a valuable approach, we feel this is out of scope for the current tool, and welcome users to apply their own smoothing methods in plotting scripts based on SCUBA.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Thank you for the suggestion. We have added the suggested documentation to the README and the user guide vignette on the pkgdown website. We have also updated function documentation in SCUBA v.1.1.0.
Competing Interests: We disclose no competing interests. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 21 Oct 2024

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 02 Jun 25			read	read
Version 1 21 Oct 24	read	read	read

Huamei Li, Nanjing University, Nanjing, China
Kristian Ullrich, Max Planck Institute for Evolutionary Biology, Plon, Germany
Damian Panas, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland; Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland

Marcin Tabaka, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland
Benedikt Obermayer, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

8 Views

19 Aug 2025 | for Version 2

Benedikt Obermayer, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany

8 Views Cite this report Responses(0)

Approved

This manuscript presents an R package for unified data access to single-cell genomics objects from different commonly used formats (Seurat, SingleCellExperiment, and anndata). In version 2, most issues raised by previous reviewers were satisfactorily addressed, and the paper has reached a sufficiently sound stage.
I have one more major issue and a few small corrections that could be incorporated.

Major issue:
I agree with Damian Panas and Marcin Tabaka in that this package does not really present a significant advancement or novel solution to comprehensively address the problem of interoperability. People who manage to successfully set up reticulate and conda environments to be used within their R installation would be expected to be able to get necessary data out of objects in different formats. People with less expertise for which a visualization / data exploration tool allowing for different input formats might be most useful will probably not be able to set this up in a reasonable time frame. In that case, I'd prefer a web-based or maybe Docker-based explorer that accepts R as well as python objects as input.

Minor issues:
- I don't really understand the advantage of SCUBA over the anndata R package for reading h5ad files and accessing their contents, apart from a unified syntax. (However, anndata specific data structures such as the ad$raw slot don't seem to be accessible using SCUBA). SCUBA appears to be somewhat faster than anndata in my hands, but that is probably of little concern to most users. This could be explained better.
- I noticed two typos (Suerat instead of Seurat in Fig. 1 caption, and SueratWrapper instead of SeuratWrapper on p4 bottom).
- When installing the package, my existing conda env with anndata did not have a sufficiently recent version (including anndata.abc, which I think was introduced in 0.10). Requirements don't specify this
- the plot functions still use the deprecated FetchData method. Why is density plot not part of the package if it's used in the User Guide?

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, single-cell genomics, computational biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

12 Views

13 Aug 2025 | for Version 2

Marcin Tabaka, International Centre for Translaional Eye Research, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland

12 Views Cite this report Responses(0)

Approved

The revised article comprises significant improvements on the initial submission. The comprehensive documentation, created using the pkgdown framework, is an essential addition. The documentation is now clearly structured, detailed, and supplemented by numerous examples. The authors have reformatted the manuscript to improve clarity and expanded the benchmarking section to provide a more informative performance comparison. Another significant addition is the new fetch_data() function, which resolves one of the most important issues raised in the initial submission. While a few minor problems presented earlier remain, the most critical concerns have been addressed and resolved. Overall, the current version of the SCUBA is a robust, high-quality tool that demonstrates thoughtful improvements in its usability.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Single-cell Genomics, Bioinformatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

27 Views

29 Jan 2025 | for Version 1

Marcin Tabaka, International Centre for Translaional Eye Research, Institute of Physical Chemistry, Polish Academy of Sciences, Warsaw, Poland

27 Views Cite this report Responses(1)

Not Approved

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Single-cell Genomics, Bioinformatics

Respond to this report

Responses (1)

Author Response

16 Jun 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this study, William M. Showers and colleagues developed SCUBA, an R package designed to facilitate the access to single-cell data stored in various formats commonly utilized by single-cell data analysis software in R (Seurat, Scran) or Python (Scanpy). These single-cell data storage formats include SeuratObject (Seurat), SingleCellExperiment (Scran, Bioconductor’s single-cell tools), Anndata (Scanpy). Interoperability between R-based and Python-based tools and data types is a real problem in bioinformatics. Several tools are typically used in this case, such as zellkonverter, anndataR, sceasy, or SeuratDisk. However, these tools rely on data type conversion, which can be computationally expensive. SCUBA implements a unified API for single-cell object formats to efficiently access data for exploratory analysis and visualization. It extracts from storage objects using FetchData methods a table with requested features, metadata, and reduction coordinates for each cell. The manuscript is clearly written, with a logical structure that effectively conveys the study's objective and organized in the following structure: 1) Introduction; 2) Methods including subsection Implementation and Operation; 3) Use cases; and 4) Conclusions. The “Use cases” section involve examples of Scuba usage: 1) application of FetchData function to retrieve features, metadata, and dimensionality reduction coordinates; 2) S3 generics and methods for similar tasks as in (1) including functions fetch_metadata and fetch_reduction; 3) speed tests of the functions for SeuratObjects stored in RDS format, SingleCellExperiment and Anndata in h5 formats; 3) presentation of example scripts for single-cell data exploration and visualization. While the issue of single-cell data interoperability is a well-recognized and pervasive challenge in the field, the manuscript in the current form does not appear to present a significant advancement or novel solution to address this problem comprehensively. It represents rather a set of functions that allows accessing data from various single-cell data objects needed for visualization purposes and should be a part of their scExploreR package rather than the standalone package.

Thank you for this assessment. The observation that SCUBA is a set of functions for accessing data from single-cell objects is correct, but we disagree that SCUBA should not be a standalone package. The functions provided by SCUBA are useful both in fetching data for interactive report generation (the usage context for scExploreR), and for static report generation by bioinformaticians. The consistent usage of SCUBA across object formats facilitates, for example, re-running a specific visualization script on data from a group that uses a different object class. SCUBA is a tool for developers to work seamlessly across different object classes, and we envision it as the basis of an analysis suite of static visualization packages, as well as the basis of scExploreR. We improved our documentation with examples of how users would integrate SCUBA into their analysis scripts.

Other comments:
1) The authors justify the need for SCUBA development by the fact that conversion between different data storage formats results in data loss upon conversion or this process is error-prone. Additionally, the rapid development of single-cell packages means that conversion functions are difficult to maintain and frequently break due to changes in other packages. I don’t understand why data is lost during conversion or why it is error-prone and how their approach overcome these limitations. For me, Scuba has exactly the same problems. In case of mentioned updates in single-cell third-party software and storage formats, Authors’ SCUBA code will require updates in the same way.

While it is true that SCUBA is vulnerable to changes in underlying object class implementation and must be maintained, we see this vulnerability as being less severe than with conversion packages. Conversion methods such as those mentioned aim to convert the entirety of each object class to another class, while SCUBA functions each extract small subsets of data. While still requiring active maintenance to track upstream changes, we believe this atomic approach is substantially less vulnerable to breaking changes that affect the entire package.

2) Authors should explain how different visualization software like CellxGene and many others retrieve features or cell metadata and benchmark them against SCUBA.

Thank you for the suggestion. We don’t feel this is in scope, as the API for data access in CELLxGENE and other visualization software is not exposed to end users in the way that Seurat’s FetchData is.

3) One of the most common features of exploratory analysis is retrieval of feature values such as gene expression or chromatin accessibility from raw count or processed feature matrices. While this function is presented in Figure 2, it is omitted in the performance testing in Figure 4. Why? Can SCUBA extract, for example, gene expression values for a specified gene from huge matrices (500k cells) from SeuratObjects, SingleCellExperiments, Anndata objects in efficient manner? Authors show only retrieval of cell coordinates and cell metadata which are stored in smaller data structures.

Thank you for pointing this out. The function mentioned for retrieval of gene expression data was benchmarked in figure 4A. The captions for figure 4C and figure 4A were switched, which incorrectly suggested that figure 4A was benchmarking fetch_reduction. We regret the confusion caused by this error and have corrected it in the revised version. Figure 4A has also been changed to more clearly reflect that the performance of the original Seruat FetchData is being compared to the methods added in SCUBA.

4) It is not clear if Authors developed also a new faster version of FetchData for SeuratObject.

We did not. Figure 4A has been updated to make this clear. In addition, we created a snake case fetch_data generic in response to comment 8, below. `fetch_data.Seurat` is simply a wrapper for Seurat’s FetchData, while `fetch_data.SingleCellExperiment` and `fetch_data.AnnDataR6` are added by SCUBA. This has been made clear in the manuscript and the updated documentation.

5) The speed tests in Figure 4 are confusing. It’s not clear when Authors use native “FetchData” functions from Seurat/Scanpy and when from SCUBA. For example, Figure 4: fetch_metadata is slower than “FetchData” from Scanpy for anndata?

As mentioned above, we feel that the new generic fetch_data makes this clear. Figure 4B and figure 4C compare the `fetch_data` method in SCUBA to `fetch_metadata` and `fetch_reduction`, respectively.

6) All the differences in run times in A-C are of the order of 0.5-2 seconds, so negligible for users analyzing the single-cell data.
Thanks for this observation. We agree in the context of interactive analysis in an R session, but disagree in the context of interactive web apps that plot data from single-cell objects with many cells. Regardless, while this is a valid point, we feel the main benefit of SCUBA is the consistency in syntax across object formats.

7) Authors state: “Analysis tools generally only accept one object class, forcing users to commit to a suite of packages at the beginning of an analysis. This restriction limits access to tools outside that suite, creating “walled gardens” that pose several challenges for single-cell analysis.” How Scuba overcomes this problem?

While SCUBA does not fully overcome this problem, we see SCUBA as being a dependency of packages that would do so. We envision an analysis suite with packages for analyses such as differential gene expression and gene set enrichment analysis, as well as static and interactive visualizations. Packages built on SCUBA access functions will be flexible by default to input object class, and should be easier for end users to use compared to running a function from a conversion package, and then running an existing function from an analysis package. We also expect packages based on SCUBA to be easier to develop, since the access functions are easier to use than running class-specific access code, and have consistent syntax.

8) The package lacks robust and well-organized documentation, such as that provided by the aforementioned zellkoverter, anndataR, or SeuratDisk. The use of the FetchData function is also unclear. The name exceptionally follows the Pascal case naming convention, unlike all other functions which follow the Camel case. If this is done deliberately, e.g., to highlight the use of the Seurat or SeuratObject packages, please consider including a namespace so that the user is aware of the external packages being employed. Alternatively, consider adding the fetch_data function to make the SCUBA package appear more like a closed, independent toolkit.

We appreciate these observations. In addition to overhauling the function documentation for clarity, we added a pkgdown site at https://amc-heme.github.io/SCUBA/. In the site, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets.
As suggested, we created a new generic fetch_data, and moved the pascal case FetchData methods added by SCUBA to this generic. The fetch_data method for Seurat is a wrapper for Seurat’s FetchData method, and the fetch_data methods for SingleCellExperiment and anndata are added by SCUBA. In addition to improving clarity, the fetch_data generic removes the need for a dependency on Seurat.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

23 Views

15 Jan 2025 | for Version 1

Kristian Ullrich, Scientific IT group, Max Planck Institute for Evolutionary Biology, Plon, Germany

23 Views Cite this report Responses(1)

Approved With Reservations

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.
The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript has been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.
Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.
In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

Minor comments

Please change the DESCRIPTION and add the corresponding author/creators in the github repository.
Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.
Please remove the following example part from your github pages, sicne the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.
Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.
Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.
Please provide a working example for a default user without admin priviliges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.
Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.
Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

References

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Comparative Genomics, Bioinformatics, R Programming

Respond to this report

Responses (1)

Author Response

16 Jun 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

In this article the authors provide a valuable R package that implements a unified data access API for single-cell object formats like `anndata` objects (Python) and `Seurat/SingleCellExperiment` objects (R). The SCUBA API (implemented in R) keeps the original single-cell objects unmodified and use three data accession methods, namely `FetchData`, `fetch_metdata` and `fetch_reduction`. To more easily compare pre-processed data stored in common single-cell data formats, like `AnnData`, `SingleCellExperiment` (SCE) and `Seurat`, the authors fetch metadata from cells and genes from the original object and create a `data.frame` R object with the requested features. Using the `data.frame` object, downstream plotting function are provided to create plots in the `ggplot2` R package grammar.

Code review
Given the dependency issues related to R-base version and Python version, I would encourage the authors to submit their package to either CRAN or Bioconductor R repository so that R package community standards are tested in a continious integration setup. The authors should try to alter their code to pass at least the R-CMD-check without warnings and errors.

Thank you for this valuable feedback. While we do not intend to submit this package to CRAN or Bioconductor, we recognize the value of continuous integration and have ensured that the code now passes the R-CMD-check without warnings or errors.

In the "Use cases" section, the authors claim that "Additional vignettes are provided on the SCUBA GitHub page", which is not true time of writing this review. Please add the mentioned vignettes to the GitHub page.

We have added a vignette with examples of all major SCUBA functions (see the comment below).

The provided R functions should contain runnable examples. Building a vignette would help the community to present the basic functions from the R package and provide more background information, if needed. I would suggest the authors to include a github workflow to create the R package documentation, vignettes and convert it into a website e.g. with the `pkgdown` package or similar workflow.

We appreciate this feedback, and have improved our documentation to address these important observations. Specifically, we have added a pkgdown site at https://amc-heme.github.io/SCUBA/. Here, we have added the “User Guide” vignette to the “Articles” tab, which contains a walkthrough of key functions in the SCUBA package for the access, exploration, and visualization of single-cell datasets. The vignette shows usage examples for all three object types currently supported by SCUBA. In addition to adding the vignette, we updated the function documentation to improve clarity and provide usage examples.

Software

I was able to install the software and run the examples given on the main github page, despite the need to pre-load the `Seurat` R package. As indicated in the Minor comments part, the three functions `reduction_names()`, `assay_names()` and `features_in_assay()` have been removed from the repository and the given example code block should be also removed from the github page.

Thanks for letting us know about this. We have removed reduction_names() and assay_names() from the documentation to avoid this confusion. Features_in_assay() was added back to the package, and documentation on usage of this function is now on our pkgdown site, in the “User Guide” article.

The Rmarkdown files on the github page https://github.com/amc-heme/SCUBA_Manuscript have been inspected and partially run. I have not re-created the subset data sets, however the code base seems to be fine.

We have double-checked the code in amc-heme/SCUBA_Manuscript internally for consistent outputs. Please let us know if you notice further issues with the code in the manuscript repo.

Major comments

If many genes are requested by the `FetchData` function, creating a single `data.frame` has the offsite to not benefit from sparse data structures leading to high memory footprints. Please implement some checks to prevent a user given the local resources to request too many gene features.

Thank you for this observation. We don’t feel that it is necessary to prevent users from doing this, but we have added a warning to users (shown below) in this scenario. The warning appears when 1000 or more genes are requested.

"A very large number of features was requested (<> features). fetch_data is not intended to be used with feature queries of this length. Data is returned in a dense format, so the memory usage of the output may be very large. Also, this query may take a while to complete."

Please remove "manually" from the sentence: "Currently, the most efective solution is to manually convert between object classes." Here, manually implies a programatically high efford to do so, which in my opinion is not true, given the existing converting tools like `SeuratWrapper`, `zellkonverter`, `SCEasy Converter` `scDIOR`. Please add the corresponding literature for the mentioned tools.

Thank you for pointing this out. The sentence was changed as follows in the updated manuscript:

"All major single-cell analysis packages implement conversion functions, and third-party packages such as sceasy, Zellconverter, and scDIOR are specifically designed for these conversions. Additionally, SeruatWrapper implements conversion functions that allow several otherwise-incompatible single-cell analysis packages to be used on Seurat objects."

Citations for the corresponding literature have also been added to the revised manuscript.

In your R DESCRIPTION file you set the `reticulate` and `anndata` R packages as suggests, however for your API to be fully functional these packages needs to be imported. Please change accordingly.

We appreciate this observation. While we prefer not to require users to install anndata and reticulate if they are not using anndata objects, we have added a conditional statement in all exported AnnDataR6 methods to check if these packages are installed, and throw an error directing users to install these packages if they are not.

Minor comments
Please change the DESCRIPTION and add the corresponding author/creators in the github repository.

This change was made.

Please alter the examples from your initial github page so that they work as expected. E.g. the `FetchData` function for Seurat objects is not pre-loaded, one need to first import Seurat `library(Seurat)`.

To keep users from having to import Seurat to use our FetchData methods, we moved them to a new generic defined within our package (fetch_data).

Please remove the following example part from your github pages, since the functions have been removed and are not working anymore `reduction_names()`, `assay_names()` and `features_in_assay()`.

All three functions have been removed from the README on our GitHub page. We added `features_in_assay()` back to SCUBA, and have added documentation for the function in our user guide vignette.

Please update the corresponding `fetch_...` function main description and param field about what object types are supported, since the `AnnDataR6` class in most cases is supported but not mentioned.

This has been corrected. All fetch_* functions and all functions that take an object as a parameter now mention all object types corrected. The description text for the `object` parameter is now also consistent across all functions.

Please rephrase "Single-cell analysis can be inaccessible to bench scientists due to programming experience required, ...", in my opinion this sentence is formulated too harshly, as it is possible e.g. to convert and analyze single-cell objects with GUI-based solutions such as usegalaxy.eu and the SCEasy Converter plugin.

We have modified this sentence to the following: “Single-cell analysis requires programming experience for full customization of analysis and visualization, and different object formats make it even more difficult for biologists to analyze data”.

Please provide a working example for a default user without admin privileges how to setup and use `reticulate` in R for the usecase described here to access an `AnnData` object and how to install the prerequisite Python packages. The authors refer to the main `reticulate` documentation, however, showing a working code snippet example would be beneficial for scientists not fluent in both programming languages.

Thanks for this suggestion. We have added this to the README, under the section “Additional Installation for anndata Objects”.

Please remove "for all three object types" from Figure 5A "Common plotting script for all three object types", since the plotting function applies on the constructed intermediate `data.frame` object.

This has been changed. The sentence now reads “Example script for visualizing expression of a gene by cluster in a density plot.”

Figure 6, please change the code example to use the R default pipe operator `|>` instead of the `magrittr` pipe operator `%>%` so that one would not need to pre-load the corresponding libraries.

This change has been made. Please see figure 6 in the revised submission.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

19 Views

17 Dec 2024 | for Version 1

Huamei Li, Nanjing University, Nanjing, Jiangsu, China

19 Views Cite this report Responses(1)

Approved

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, Immunogenetics, Single-cell and spatial technologies

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

09 Aug 2025

William Showers, Division of Hematology, University of Colorado Anschutz Medical Campus, Aurora, USA

Thank you for the time and effort invested in your review. We have modified the manuscript and package, and we feel it is stronger thanks to your feedback. Please see below for point-by-point responses.

Showers et al. developed the SCUBA R package as a unified API for single-cell data analysis. In single-cell and spatial transcriptomics studies, data format conversion between Seurat, SingleCellExperiment, and AnnData often presents challenges, leading to obstacles in downstream analyses. While several tools have been developed to address data format conversion issues, SCUBA provides a unified and user-friendly interface for handling multiple formats and simplifies single-cell data processing workflows, making it a promising tool for practical applications. However, several concerns regarding SCUBA require further clarification:

1) The README file available at https://github.com/amc-heme/SCUBA/tree/main does not include examples demonstrating how SCUBA reads Seurat, SingleCellExperiment, and AnnData formats from local files. Additionally, it is unclear whether SCUBA addresses compatibility issues arising from different versions of these formats, such as Seurat 4.0+ versus Seurat 5.0+. Can SCUBA facilitate seamless data format conversion and overcome challenges posed by version discrepancies?

Thank you for this observation. We have added new documentation via a pkgdown website (https://amc-heme.github.io/SCUBA/). The Article “User Guide” contains a note on how each object class is loaded. SCUBA is seamlessly compatible with Seurat v4 and v5 objects. We have added a note on this to the README.

2) While SCUBA provides application examples for single-cell datasets, it remains unclear whether the tool can be effectively extended to spatial transcriptomics datasets generated by different platforms. Clarification on its applicability to spatial datasets would be beneficial.

SCUBA partially supports spatial datasets. Aspects of spatial datasets that can be expressed as a counts matrix can be loaded into SCUBA, but we do not yet support the loading of image files. We have added a note on this to the README.

3) To enhance visualization capabilities, it is recommended that SCUBA include optional smoothing methods for displaying the expression distribution of feature genes. This would allow for better identification of regions with concentrated expression of specific features.

Thank you for the suggestion. While we agree that this is a valuable approach, we feel this is out of scope for the current tool, and welcome users to apply their own smoothing methods in plotting scripts based on SCUBA.

4) The SCUBA documentation could be further improved by providing more detailed explanations of the functions, their usage, and intended purposes. Additionally, including a comprehensive analysis example that integrates both single-cell and spatial transcriptomics data would help illustrate the full workflow and practical utility of SCUBA.

Thank you for the suggestion. We have added the suggested documentation to the README and the user guide vignette on the pkgdown website. We have also updated function documentation in SCUBA v.1.1.0.

View more View less

Competing Interests

We disclose no competing interests.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Schäfer PSL, Dimitrov D, Villablanca EJ, et al.: Integrating single-cell multi-omics and prior biological knowledge for a functional characterization of the immune system. Nat. Immunol. 2024; 25: 405–417. PubMed Abstract | Publisher Full Text

[2] 2. Zeng AGX, et al.: A cellular hierarchy framework for understanding heterogeneity and predicting drug response in acute myeloid leukemia. Nat. Med. 2022; 28: 1212–1223. PubMed Abstract | Publisher Full Text

[3] 3. Kiselev V, Huang N: sceasy.2022.

[4] 4. Wolf FA, Angerer P, Theis FJ: SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018; 19: 15. PubMed Abstract | Publisher Full Text | Free Full Text

[5] 5. Hao Y, et al.: Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 2024; 42: 293–304. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Wickham H: S3. Advanced R. Chapman and Hall/CRC; 2019. Publisher Full Text

[7] 7. Ushey K, Allaire J, Tang Y: reticulate: Interface to ‘Python’.2023.

[8] 8. Wickham H, Hester J, Chang W, et al.: devtools: Tools to Make Developing R Packages Easier.2022.

[9] 9. Wickham H: testthat: Get Started with Testing. The R Journal. 2011; 3: 5. Publisher Full Text

[10] 10. Velten L, Triana S, Haas S, et al.: Expression of 197 surface markers and 462 mRNAs in 15281 cells from blood and bone marrow from a young healthy donor. [Dataset]. Figshare. 2021. Publisher Full Text

[11] 11. Triana S, et al.: Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat. Immunol. 2021; 22: 1577–1589. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. The pandas development team: Pandas.2023. Publisher Full Text

[13] 13. Harris CR, et al.: Array programming with NumPy. Nature. 2020; 585: 357–362. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Virtanen P, et al.: Author Correction: SciPy 1.0: fundamental algorithms for scientific computing in Python (Nature Methods, (2020), 10.1038/s41592-019-0686-2). Nat. Methods. 2020; 17: 352–352. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Virshup I, Rybakov S, Theis FJ, et al.: anndata: Annotated data.2021. 2021.12.16.473007. Publisher Full Text

[16] 16. Conda contributors: Conda: A system-level, binary package and environment manager running on all major operating systems and platforms.2024.

[17] 17. Wickham H: Ggplot2: Elegant Graphics for Data Analysis. Switzerland: Springer; 2016. Publisher Full Text

[18] 18. Bredikhin D, Kats I, Stegle O: MUON: multimodal omics analysis framework. Genome Biol. 2022; 23: 42. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Virshup I, et al.: The scverse project provides a computational ecosystem for single-cell omics data analysis. Nat. Biotechnol. 2023; 41: 604–606. PubMed Abstract | Publisher Full Text

[20] 20. Showers W, Desai J, Gipson S, et al.: scExploreR: a Flexible Shiny App for Democratized Analysis of Multimodal single-cell RNA-seq Data.2024.

[21] 21. Siletti K, et al.: Human Brain Cell Atlas v1.0. [Dataset]. CELLxGENE. Reference Source2023.

[22] 22. Izrailev S: tictoc: Functions for Timing R Scripts, as Well as Implementations of ‘Stack’ and ‘StackList’ Structures.2023.

SCUBA implements a storage format-agnostic API for single-cell data access in R

Abstract

Keywords

Introduction

Figure 1. SCUBA addresses challenges posed by multiple object formats in single-cell sequencing data.

Methods

Implementation

Operation

Use cases

FetchData Methods for SingleCellExperiment and Anndata Objects

Figure 2. The methods added by SCUBA simplify the retrieval of data from supported object classes.

Metadata, reduction-specific accession methods

Figure 3. SCUBA methods specific to the retrieval of metadata and reduction coordinates.

Figure 4. Performance testing of SCUBA functions and methods.

Example scripts created with SCUBA

Figure 5. SCUBA enables flexible plotting scripts harmonized across object types.

Figure 6. SCUBA simplifies common object exploration operations.

Conclusions

Ethics and consent

Data and software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated