Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.173901.1

Method Article

Articles

From Quality Control to Visualization: scToolkit for Reproducible Single-Cell RNA-Seq

[version 1; peer review: 3 approved with reservations]

Krolla

Christopher

Conceptualization Data Curation Investigation Methodology Resources Software Visualization Writing – Original Draft Preparation Writing – Review & Editing 1 Das

Tonmoy

Conceptualization Data Curation Methodology Software Visualization Writing – Review & Editing https://orcid.org/0000-0001-7347-1573 1 2 Metzger

Ariane

Investigation Writing – Review & Editing 1 2 Boerries

Melanie

Funding Acquisition Resources Writing – Review & Editing https://orcid.org/0000-0002-8900-503X 1 3 Chakraborty

Sajib

Writing – Original Draft Preparation Writing – Review & Editing https://orcid.org/0000-0002-8900-503X a 1 Andrieux

Geoffroy

Funding Acquisition Supervision Writing – Review & Editing https://orcid.org/0000-0002-5389-9481 b 1 1Institut für Medizinische Bioinformatik und Systemmedizin, Albert-Ludwigs-Universitat Freiburg Medizinische Fakultat, Freiburg, Baden-Württemberg, Germany 2Albert-Ludwigs-Universitat Freiburg Fakultat fur Biologie, Freiburg, Baden-Württemberg, Germany 3German Cancer Consortium (DKTK), Partner site Freiburg, a partnership between DKFZ and Medical Center, Albert-Ludwigs-Universitat Freiburg, Freiburg, Baden-Württemberg, Germany

a sajib.chakraborty@uniklinik-freiburg.de b geoffroy.andrieux@uniklinik-freiburg.de

No competing interests were disclosed.

28 1 2026

2026

129

19 1 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Single-cell RNA sequencing (scRNA-seq) provides high-resolution insights into cellular states, yet its analytical workflows are often fragmented and inconsistently documented. Challenges such as technical noise, dropout events, batch effects, and insufficient reporting of method parameters can hinder reproducibility and the interpretation of biological signals.

Methods

We developed scToolkit, a workflow built on the Scanpy framework to streamline core steps in scRNA-seq analysis. The toolkit implements quality control, preprocessing, clustering, and multiple downstream analytical modules within a unified Python interface. It also incorporates external resources, including curated cell-marker databases and pathway collections, and employs vectorization, just-in-time compilation, parallelization, and optional GPU acceleration to enhance performance. Comprehensive visualization tools support exploratory and comparative analyses.

Results

We applied scToolkit to a dataset of immune cells from blood and tonsil tissues to illustrate its functionality. The workflow produced interpretable clustering results, informative differential expression analyses, and clear visual comparisons across tissues and conditions. Enhanced plotting modules facilitated the examination of transcriptional differences across clusters and within individual clusters under matched experimental conditions.

Conclusions

scToolkit provides an accessible and scalable solution for scRNA-seq data analysis, helping users perform rigorous and reproducible workflows with reduced technical burden. Its integration of curated external resources and high-performance computation improves analytical consistency and usability. The workflow is openly available on GitHub, accompanied by documentation and tutorials to support broad adoption.

scRNA-seq Scanpy workflow Quality Control Normalization Clustering Database Cell-Type Annotation Visualization

Bundesministerium für Bildung und Forschung

PM4Onco–FKZ(01ZZ2322A)

EkoEstMed–FKZ(01ZZ2015)

Deutsche Forschungsgemeinschaft

CRC1160(256073931)

CRC/TRR167(259373024)

CRC1453(431984000)

CRC1479(441891347)

TRR359(491676693)

TRR353(471011418)

FOR5476UcarE(493802833)

Deutsche Forschungsgemeinschaft CRC1160 (256073931), CRC/TRR167 (259373024), CRC1453 (431984000), CRC1479 (441891347), TRR359 (491676693), TRR353 (471011418), FOR 5476 UcarE (493802833) Bundesministerium für Bildung und Forschung PM4Onco–FKZ (01ZZ2322A), EkoEstMed–FKZ (01ZZ2015)

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Introduction

Single-cell RNA sequencing (scRNA-seq) has significantly advanced our ability to profile and characterize cellular heterogeneity within biological systems. Its analytical landscape remains burdened by an excess of redundant computational tools that complicate reproducibility and hinder methodological clarity. Many workflows rely on extensive data reshaping through integration, imputation, or variance-stabilization procedures that lack validated noise models and are often applied to compensate for suboptimal hyperparameter tuning. Clustering strategies further disregard the hierarchical structure of cellular organization, obscuring lineage relationships and biologically meaningful subpopulations.

The field is characterized by an overwhelming expansion of computational tools, ¹ many of which duplicate existing functionality rather than advancing methodology, resulting in a fragmented ecosystem that hinders comparability and reproducibility. Integration across samples, batches, studies, or datasets should be attempted only after hyperparameters have been systematically evaluated to minimize technical variation across datasets. When strong batch effects persist, their underlying sources must be identified and quantified. Variation introduced by library preparation, sequencing chemistry, or cell handling should be addressed both experimentally and analytically rather than corrected algorithmically. Imputation and noise-reduction methods can suppress apparent noise but often distort the original signal. The statistical nature of scRNA-seq noise remains poorly defined ² and current approaches lack validated models to separate technical from biological variation. ³ Recent studies show that the perceived improvement in signal quality may instead reflect amplified noise. ^{3,
4} The limited understanding of the statistical properties of scRNA-seq data is also evident in simulation frameworks. Benchmarking studies typically measure similarity between real and simulated data, but not their differences, ⁵ leading to evaluations that favor methods reproducing feature-extracted, algorithmic reconstructions of the input, rather than generating biologically meaningful variation.

Many clustering strategies often ignore the hierarchical structure of cellular organization. Major lineages should first be separated at coarse resolution, followed by finer subdivision to resolve transcriptionally distinct subpopulations. Because Leiden, the standard method for single-cell RNA-seq clustering, operates at fixed granularity, ⁶ without sequential subclustering, cells from distinct lineages are compared against each other in one-versus-rest differential analyses, obscuring lineage relationships and confounding interpretation.

Inferring genomic alterations from RNA-based data is conceptually fragile, as transcriptional variation reflects regulation and noise rather than true genomic state. Reference-free methods such as CopyKAT ⁷ and SCEVAN ⁸ were published without evaluation on healthy tissue, inevitably leading to false-positive CNV calls. ⁹ Reference-based tools such as InferCNV ( https://github.com/broadinstitute/infercnv) and sciCNV ¹⁰ must include standardized, well-characterized reference panels validated across multiple tissues to ensure reproducibility and reduce context-dependent bias in CNV detection. Reliable downstream analyses require high-quality data ( https://kb.10xgenomics.com/hc/en-us/articles/ 115002022743-What-is-the-recommended-sequencing-depth-for-Single-Cell-3-and-5-Gene-Expression-libraries ) typically exceeding 50,000 reads per cell, with >80% sequencing saturation, a standard rarely met in routine single-cell experiments such as Human Tonsil Atlas atlas. ¹¹

Trajectory inference and RNA velocity analyses are particularly unstable. Preprocessing differences and quantification inconsistencies can produce divergent or even contradictory trajectories, and structure may emerge even from randomized data, ¹² turning technical noise into biological narratives and underscoring the need for standardized preprocessing and validation.

All user-defined settings, including filtering thresholds and model hyperparameters that directly affect interpretation, are treated as hyperparameters because they determine analytical outcomes and reproducibility. Hyperparameter tuning is frequently underestimated, and many studies fail to document how these values are selected or adjusted. ¹³ These choices have a direct impact on normalization, integration, and clustering, yet are often applied ad hoc without systematic evaluation. Transparent and reproducible parameter management is essential to ensure that results reflect biological reality rather than computational or technical artifacts. Improper configurations can generate false biomarkers and misguide clinical decisions with potentially harmful consequences.

To overcome these limitations, we introduce ScToolkit, a consolidated framework that expands the capabilities of Scanpy ¹⁴ without fragmenting the ecosystem further. It provides a coherent structure linking all major analysis steps, from quality control and normalization to clustering and downstream interpretation. All analytical choices are stored in a unified configuration within the AnnData object, ensuring transparent, comparable, and reproducible workflows. The framework integrates external biological reference resources and supports vectorization, JIT (Just-In-Time) compilation, and GPU acceleration for efficient large-scale analysis. Together, these features create a unified environment that strengthens analytical precision, preserves biological interpretability, and upholds good scientific practice.

Methods Implementation

Data initialization with setup_adata

In contrast to default Scanpy, where preprocessing parameters must be specified at each call and not centrally stored, ScToolkit introduces a standardized initialization step via the setup_adata function. This augments the AnnData object with: 1.

Configuration dictionary (adata.uns[“config”])

A unified registry covering quality control, normalization, clustering, plotting, and downstream analyses parameters. It encodes both Scanpy arguments with proper defaults and project-specific arguments, including curated marker-gene handling, organism-specific geneset handling, and file output policies. GPU acceleration is integrated into the config, enabling automatic switching between CPU and GPU backends for PCA, nearest-neighbor search, and clustering.

Statistics registry (adata.uns[“stats”])

A bookkeeping structure recording which preprocessing steps have been executed, preventing redundant or conflicting operations across the pipeline.

Existing configurations and statistics are preserved unless explicitly overwritten, facilitating reproducibility and iterative refinement of analyses. By consolidating the config, state-tracking, and GPU initialization into a single entry point, setup_adata transforms a raw AnnData object into a fully contextualized analysis container. This contrasts with unmodified Scanpy, where users must manually manage parameters, track preprocessing steps, and configure hardware acceleration. From now on if config is stated, it refers to adata.uns[‘config’].

Quality control preprocessing with all_qc_preprocessing

After initialization, ScToolkit provides a dedicated entry point for systematic quality control via the all_qc_preprocessing function. This standardizes QC annotation, reduces defective cell likelihood, and ensures reproducibility and interoparability across datasets.

Key operations include: •

Gene family annotation using curated organism-specific definitions for mitochondrial, ribosomal, hemoglobins, pseudogenes, and immunoglobulin/T-cell receptor loci. Gene families are projected into adata.var as boolean flags for downstream filtering and visualization. Databases bundled with the package (Ensembl-GTF-derived, ¹⁵ HGNC/MGI ^{16,
17}) curated categories, and housekeeping gene sets ¹⁸) extend Scanpy’s defaults by systematically tracking additional biologically relevant gene families.

•

QC metric calculation adapts Scanpy’s standard metrics by removing redundant keys, simplifying names, and casting percentages to reduced precision (float16) for memory efficiency. All QC keys are tracked in the config for consistent reuse across analyses and reproducibility.

•

Custom metrics provide additional sensitivity, including n_unique (spread of expression values per cell), max_count, and top-ratio scores (top_5_ratio, top_10_ratio) to detect overexpression bias. These capture low-quality transcriptomes that would pass default thresholds.

•

Filtering and tagging apply thresholds specified in the config (pp.filter_qc_var_obs, pp.filter_qc_var_var). Failures can be flagged or removed, with counts and percentages summarized in adata.uns[“stats”]. Plots include threshold overlays and clipping of extreme outliers, improving robustness compared to Scanpy defaults.

•

Config tracking ensures reproducibility by centralizing all QC-related keys, thresholds, and gene family definitions in adata.uns[“config”]. This provides transparency and reproducibility absent from default Scanpy, where QC keys are scattered and not formally registered.

Pre-clustering normalization with Run_all_prep_steps_clustering

Following QC, Run_all_prep_steps_clustering provides centralized preprocessing and normalization. This integrates normalization, log transformation, scaling, and gene selection into a reproducible pipeline controlled entirely by the config.

Extensions beyond Scanpy include: •

Mixture of Regex-based and gene-list gene exclusion and centralized control of preprocessing steps.

•

Modular storage of intermediate states in adata.layers, enabling reproducibility and GPU/CPU branching.

•

Explicit logging of executed preprocessors in adata.uns[“stats”][“preprocessors_ran”], ensuring idempotence and transparency.

By consolidating these steps, the function guarantees consistent inputs for clustering, avoiding redundant calls and user errors.

Clustering orchestration with cluster_SC_scanpy_like

cluster_SC_scanpy_like structures PCA, neighborhood graph construction, Leiden clustering, and UMAP embedding in a single highly configurable function, providing a failsafe and streamlined workflow beyond Scanpy’s separate steps. Each step can be perfomed on a CPU or GPU.

Enhancements include: •

PCA with variance-based selection of informative components, capped by data rank, to estimate the appropriate number of components, n_pcs.

•

Neighbor graph construction using the selected n_pcs, aligned with explained variance contribution, avoiding arbitrary choices such as manual elbow selection.

•

search_for_leiden_resolution and search_for_neighbor_params systematically vary resolution, n_pcs, and n_neighbors to identify parameter sets that yield stable and biologically meaningful cluster separation, returning the corresponding UMAPs with marker overlays, density plots, and separation metrics.

This ensures reproducible and biologically meaningful clustering without manual trial-and-error.

Downstream analysis and visualization

run_downstream integrates core post-clustering functionality: •

Differential expression (DEG) with parallelized Wilcoxon tests (standard z-score approximation and JIT-compiled permutation-based variants), and group-specific DEG handling.

•

Accurate log-fold change calculation per group. Unlike the simplified approach implemented in Scanpy, this method ensures formal correctness without compromising performance.

•

Export utilities generating per-cluster, per-direction CSVs with both filtered and unfiltered genes, annotated against MSigDB modules and user-defined gene sets.

•

Marker and gene set handling with bi-directional conversion between dictionaries and DataFrames, autosplitting of large sets for plotting, and prioritized ranking by highly variable gene (HVG) status.

•

Automated visualization conditioned on the config and saved into structured folder hierarchies, covering UMAP overlays in discrete and continuos, dotplots, violin plots, heatmaps, and ranked gene plots.

These downstream steps extend Scanpy’s minimal plotting and DEG support into a comprehensive, automated framework for reproducible structured interpretation.

Operation

ScToolkit is distributed as a Docker container bundling both Python and R environments to ensure reproducibility and straightforward deployment across platforms as far as possible for GPU usage. For users preferring local installation, the GitHub repository provides multiple conda environment definitions (.yml files) tailored to different use cases: •

Default environment: GPU-enabled with RAPIDS and CUDA libraries preconfigured.

•

No-GPU environment: Fallback for systems without CUDA support, such as recent macOS devices or machines limited to onboard GPUs.

•

Complete maintainer environment for development and continuous integration and documentation updating.

Documentation is generated with Sphinx and kept synchronized with the source code, ensuring that both API references and tutorials remain current. The repository includes extensive Jupyter notebooks that demonstrate full end-to-end analyses, from raw data initialization to clustering and downstream interpretation. Additional notebooks showcase advanced features such as curated gene family handling, GPU acceleration, and automated DEG export.

This dual strategy, providing both containerized execution and conda-based environments, ensures accessibility for beginners while maintaining flexibility for advanced users and developers. Together with reproducible notebooks, this structure guarantees that ScToolkit can be deployed consistently in diverse computational settings, from laptops to GPU clusters.

Use cases: Integration of immune cell scRNA-seq datasets from blood and tonsil reveals tissue-specific transcriptomic signatures Preprocessing and quality control

For use case analysis, we accessed publicly available single-cell RNA sequencing (scRNA-seq) resources. Specifically, we retrieved datasets from the DISCO database ( https://disco.bii.a-star.edu.sg/) ¹⁹ including scRNA-seq profiles from human tonsil (n = 92,114 cells) and peripheral blood (n = 169,686 cells). In addition, we obtained a tonsil scRNA-seq dataset from the Tonsil Atlas database ¹¹ encopassing 462,352 cells. The full use-case analysis described below is available in the Supplementary Data as a PDF export of the corresponding Jupyter Notebooks.

Next, we applied a rigorous quality control workflow to isolate high-quality single-cell transcriptomes and used violin plots to assess data quality across conditions and tissues ( Figure 1A). Cells were excluded if they expressed fewer than 500 genes, fewer than 700 total counts, or more than 18,000 counts. To minimize the inclusion of low-complexity libraries, cells with fewer than 10 unique genes were removed. Similarly cells with more than 5,000 genes were removed to prevent doublets. Cells with >10% mitochondrial content were excluded to reduce bias from dying or stressed cells. In addition, cells with <1.5% ribosomal transcripts or >1% HGNC-annotated pseudogene expression were filtered out ( Figure 1A). At the gene level, only features expressed in at least 2,000 cells, with a minimum of seven unique counts, a minimum of 2,000 total counts, and at least 10 maximum counts across cells, were retained. Beyond the standard parameters (n_genes and n_counts), we included n_unique as a measure of distributional spread, providing a more general indicator of transcript diversity than mean- or variance-based metrics that assume normality. The percentages of housekeeping (pct_HK_gtex), intergenic (pct_GTF_IG_genes), and pseudogene (pct_HGNC_pseudogene) expression offer complementary insights into sample composition and annotation fidelity. The displayed metrics represent a selected subset of all available quality control measures, shown here to demonstrate the extended functionality and interpretability of the workflow. In Figure 1A, only the top 10% quantile range of each metric is displayed, emphasizing the high end of the distributions where duplicate or outlier cells typically accumulate. This focused view enables refined inspection of potential overrepresented or low-complexity transcriptomes that define the upper QC thresholds. The shown metrics represent a subset of all available QC measures, selected to demonstrate the workflow’s extended functionality and interpretability.

Figure 1. Quality control assessment and database-driven refinement of Blood and Tonsil Cells.

(A) Violin plots generated with plot_qc_violins, showing standard and extended quality control metrics across groups for setting up upper QC thresholds. Each violin represents the distribution of n_genes and n_counts per cell, alongside n_unique as a measure of transcript count spread. Additional panels display pct_HK_gtex (housekeeping genes from GTEx), (immunoglobulin genes from the Ensembl GTF), and pct_HGNC_pseudogene (pseudogenes from HGNC). The dashed red line marks the applied filtering threshold, and the function displays only the defined top quantile range (here, the top 10%) to provide a focused and precise view for setting refined upper thresholds.

(B) Bar plot summarizing cross-database overlap of B-cell gene sets against non B-cell gene-sets. The first bar represents our consensus B-cell signature derived from high-recurrence markers across databases, while the remaining bars correspond to the individual B-cell gene sets used in its construction. Bar heights indicate the summed Jaccard index of each gene set against all non–B-cell gene sets in the reference databases, quantifying cross-category similarity and shown on the left legend. Red dots, connected by a line, denote the total number of genes in each respective gene set, with the corresponding scale shown on the right legend.

The results reveal that parameters such as blood of solid tumor patient (BOSTP), non-tumor disease tissue (NTDT), and Tonsil removed during surgery for oropharyngeal papillomatosis (TRDSFOP) show generally lower levels across n_genes, n_counts, and n_unique, lowering the expectation of doublets within these samples. In contrast, the consistent percent of house-keeping genes (pct_HK_gtex) values across all conditions indicate viable and transcriptionally active cells, suggesting comparable baseline activity among cell types. The pct_GTF_IG_genes distributions show a distinct multimodal pattern, where the upper mode indicates the presence of plasma cells. Finally, elevated pct_HGNC_pseudogene values in BONTDP, BOSTP, NTDT, and Disco_blood_normal flag the presence of subsets of low-quality or degraded cells. The remaining groups display lower pseudogene percentages consistent with higher-quality transcriptomic profiles.

To explore the initial annotation of cell clusters, we utilized a curated library of cell type -specific gene signatures from four complementary resources: PanglaoDB, ²⁰ CellMarker, ²¹ SCType, ²² and CellTypist (v2). ²³ To increase tissue relevance, each database was then subset signatures associated with tonsil and blood, ensuring that the resulting markers reflected the diversity of resident immune cell populations within these compartments. This process demonstrates the database integration and refinement functionality of the workflow, we used three core functions: get_ref_db(), search_database(), and get_element_and_counts_with_positional_sum(). First, get_ref_db() collects all available cell type-specific genesets from the supported databases (PanglaoDB, CellTypist, CellMarker, and SCType). Next, search_database() extracts the subset of genesets matching a specified cell type, defined by the user using regular expression.

Notably, substantial heterogeneity was observed across gene-signatures for the same annotated cell type. To address this, we generated consensus signatures by integrating overlapping features. We calculated the frequency of occurrence of each gene across signatures for a given cell type, ranked genes by their frequency, and defined the top 20 most recurrent genes as the consensus signature ( Figure 1B). Specifically, we used get_element_and_counts_with_positional_sum() to rank all genes by their recurrence across the retrieved genesets, generating consensus signature by taking the top 20 genes. As an exampler, we retrieved cross-database overlap of B-cell gene sets and subsequently generated their consensus signature. For each B cell signature from the databases and the B cell consensus signature, the summed Jaccard index was calculated against all nonB cell gene sets in the reference databases. Results showed that our B cell consensus signature exhibits a minimum similarity (relatively low Jaccard index) to nonB cell signatures compared to the database redived B cell signatures ( Figure 1B). This prevents the selection of unspecific gene-set signature from many existing signature ressources. The refined scToolkit reference represents a consensus-based, high-confidence collection of cell-type-specific gene signatures, illustrating the workflow’s ability to generalize this procedure to any cell-type-specific refinement task.

Clustering and cell type annotation of immune cells from blood and tonsil

Following quality control, we performed unsupervised clustering on the combined dataset using the Leiden algorithm. Cells were projected into UMAP space, where coloring by tissue origin revealed intermixed clustering of cells from Blood and Tonsil ( Figure 2A). The UMAP density distributions show a continuous embedding without visible batch effects between blood- and tonsil-derived cells. Cells from both tissues occupy overlapping yet structured regions, indicating comparable transcriptomic profiles and proper integration of datasets ( Figure 2A). To assign cell identities, we applied single-cell level gene set enrichment analysis using over-representation analysis algorithm (ORA) with our consensus gene signatures. Each cell was assigned the identity for which its transcriptome showed the maximum ORA enrichment score (Idxmax). The major cell types identified included B cells, dendritic cells (DCs), granulocytes, macrophages, NK cells, T cells, and plasma cells ( Figure 2B). To validate these annotations, we compared our results with reference annotations (referred to as ground-truth annotation from hereafter) provided by the DISCO and Tonsil Atlas databases ( Figure 2C). The idxmax ORA assignments align with the broad ground-truth annotation ( Figure 2C), demonstrating that clusters capture major immune and myeloid populations while maintaining distinct lineage boundaries. Notably, cell types of similar identity cluster in close proximity within the joint embedding. This confirms that the refinement and preprocessing preserve biological relationships rather than dataset origin. For instance, cells belonging to the same category (e.g., B cells or T cells) cluster together regardless of tissue origin. This highlights the robustness and reliability of our workflow, demonstrating its ability to successfully integrate datasets from different tissues without the need for specialized batch-correction algorithms ( Figure 2A-B).

Figure 2. Comparison of cell type refinement, reference annotations, and clustering results.

(A) UMAP density maps generated with plot_embedding_density(), showing 0–1 scaled cell densities estimated using FFT-based kernel density estimation (KDEpy). The upper panel represents cells of blood origin, the lower those from tonsil tissue.

(B) UMAP projection colored by the idxmax categories derived from over-representation analysis (ORA) of the cell type refinement gene sets. Each cell is assigned to the cell type with the highest enrichment score if the adjusted p-value is below the defined significance threshold, illustrating clear segregation of major immune populations and preservation of myeloid lineage relationships.

(C) Corresponding UMAP annotated with the broad reference cell type labels from DISCO and the Tonsil Atlas, where closely related subtypes (e.g., B-cell subclasses) were combined into generalized categories.

(D) Heatmap of Jaccard similarity scores between refined (idxmax ORA) and reference annotations, generated with plot_jaccard_heatmap_cluster_comparison(). Values are 0–1 scaled, with highest overlaps along the diagonal, reflecting accurate matching between methods. Cross-lineage similarities (e.g., Monocyte–Macrophage, Monocyte–Granulocyte) indicate biologically plausible relationships.

(E) UMAP displaying Leiden clusters annotated by their predominant reference-derived cell types, using only the ground truth labels.

To quantitatively validate/support our annotations, we computed the Jaccard similarity index between our ORA-based assignments and the reference (ground-truth) labels. The resulting heatmap ( Figure 2D) quantitatively supports this correspondence, with high overlap values between identical or closely related cell types and minor off-diagonal similarities reflecting lineage proximity. Concordance as reflected by the Jaccard index ( Figure 2D), demonstrated the strongest agreement (Jaccard index = 1) for the most abundant populations, including B cells, T/NK cells, granulocytes, plasma cells, DCs, and macrophages. Of note, in ground-truth annotation T and NK cells were combined into a single category, and kept separate in T and NK cells in Idxmax ORA annotations. Interestingly, cells classified as “unknown” within Idxmax ORA annotations, showed a distribution of Jaccard Scores across many ground-truth cell types with the highest score for T/NK cell categroy. To annotate the clusters we selected the predominant cell types of the ground truth ( Figure 2E). The Leiden clustering labeled with ground-truth cell types ( Figure 2E) further shows that clusters are composed of cells from both tissues and databases, confirming that integration was successful and not driven by dataset-specific biases.

Differential pathways overlap analysis between blood and tonsil

To investigate the functional roles of the annotated cell types, we performed a pathway enrichment analysis based on differentially expressed genes (DEGs). DEGs were first identified by comparing each cell type specific cluster against all remaining clusters and stringently filtered to ensure robustness. DEGs were retained only if expressed in at least 15% of cells in one of the compared cell type resolved cluster, with an adjusted p-value < 0.0001 and an absolute log ₂ fold change ≥ 0.5. Following filtering, DEGs are intersected with hallmark gene sets from MSigDB. For each pathway, the z-scores of Wilcoxon rank-test u-statistics from overlapping genes are summed and scaled to –1 to 1. This analysis revealed distinct pathway overlap patterns across cell types. Myeloid cell cluster showed strong overlap of inflammatory signaling pathways, including inflammatory response, TNF-α signaling, and complement activation ( Figure 3A). Notably, differences also emerged between T-cell subclusters where the C3: T-cell cluster showed selective overlap of Allograft Rejection and TNF-α signaling genes, which were absent in C1: T-cell cluster, suggesting functional specialization among T-cell subtypes ( Figure 3A).

Figure 3. Downstream analysis of differential expression and pathway activity.

(A) Heatmap of hallmark pathway activity across cell type-annotated clusters, generated from differentially expressed genes (DEGs) meeting stringent thresholds (adjusted p < 10-4,|log ₂FC|> 0.5, ≥15% of cells expressing in either group). DEGs were mapped to hallmark gene sets from MSigDB, and Wilcoxon rank test u-statistics were aggregated as summed z-scores, normalized between –1 and 1. Only pathways or clusters with at least one normalized score exceeding ±0.5 are displayed, emphasizing those with the strongest pathway activity changes across clusters. The heatmap summarizes pathway activity across clusters, highlighting dominant biological programs.

(B) Split dot plot summarizing differential expression results across cell type–annotated clusters. Each row represents a cluster and each column one of the top five differentially expressed genes. Dot color encodes the log ₂ fold-change between the cluster and all remaining cells, while dot size reflects the fraction of expressing cells in test cell type (left) and in any other (right). Both axes are hierarchically clustered to highlight shared expression patterns and relationships among clusters and marker genes.

(C) Split dot plot comparing differential expression between blood- and tonsil-derived cells within each cell type–annotated cluster. Layout and scaling follow panel (B). Dot color encodes the log ₂ fold-change between Blood and Tonsil in a particular cluster, while dot size reflects the fraction of expressing cells in Blood (left) and Tonsil (right). Rows represent clusters whereas columns represent the top ten differentially expressed genes per comparison, hierarchically clustered to reveal shared condition-specific regulation across clusters.

(D) Split violin plots for selected genes ( HLA-DRA , HLA-C , HLA-DQA1) across clusters, grouped by tissue of origin. Each violin is divided into halves representing blood (left) and tonsil (right) conditions; inner boxes mark medians and interquartile ranges, and points show individual cells.

Another visualization option is to show the top DEGs for each cluster as split dot plots, where the log2 fold change is color coded ( Figure 3B). The percent expression of DEGs is displayed as split circles, with the left half representing the expression of the cluster of interest and the right half representing expression of all other cells combined. Corroborating with earlier findings, DEG analysis revealed two transcriptionally distinct T cell populations. C3: T-cell cluster was characterized by high expression of canonical T cell receptor (TCR) components ( CD3E, CD3D, CD3G) together with IL32, consistent with a population of conventional, activated T-cells ( Figure 3B). In contrast, the C1: T-cell cluster lacked strong expression of CD3 genes and was instead enriched for signaling-related transcripts, including INPP4B (PI3K/AKT signaling), RASGRF2 (Ras/MAPK activation), and PLCL1 (calcium signaling regulation), as well as CATSPERB and OPRM1, suggesting a specialized, non-classical functional state ( Figure 3B). The two Bcell clusters (C0: B-cell and C4: B-cell) also showed differential marker experssion. C4: B-cell cluster express genes characteristic of mature B-cells, including B-cell receptor (BCR) components ( HLA-DRA , CD79B, CD79A), a B-cell surface antigen ( MS4A1), and a molecule involved in antigen presentation that’s also expressed in B cells ( CD74) ( Figure 3B). C0: B-cell cluster consists of genes related to B-cell activation, survival, and differentiation, with KHDRBS2 and TRIO involved in signaling, CDK14 and PRKCE playing roles in cell cycle and signaling respectively, and KCNQ5 being less specific to B-cells but also involved in cell processes ( Figure 3B).

Next, we investigated compartment-specific transcriptional differences within the same immune cell types. For this analysis, we selected clusters that contained a comparable number of cells from both blood and tonsil, including C2: Myeloid-cell, C3: T-cell, and C4: B-cell clusters. Differential expression analysis (DEG) was then performed between the two compartments (blood and tonsil) to identify genes driving tissue-specific adaptations within each cell type. DEG analysis revealed a distinct enrichment of genes associated with immune regulation and tissue adaptation in the tonsil ( Figure 3C). A comparative analysis of myeloid cell clusters from blood and tonsil revealed distinct transcriptional signatures. Blood myeloid cells were characterized by high expression of S100A4, S100A6, S100A8, and S100A9, genes, which are associated with inflammation, cytoskeletal dynamics, and acute immune activation ( Figure 3C). In contrast, tonsil myeloid cells showed elevated levels of HLA-DRA , HLA-DPB1, and HLA-DQA1, reflecting enhanced antigen-presenting capacity, together with increased expression of SERPINF1, and DNASE1L3, indicative of tissue adaptation and immune regulatory functions ( Figure 3C). Similarly, DEG analysis revealed distinct transcriptional features between blood and tonsillar T-cells. Compared with tonsillar T cells, blood-derived T cells showed higher expression of the interferon-stimulated genes IFITM1 and RARRES3, together with increased AES, PTPRCAP, and SEPT7, indicative of an interferon-responsive and activated circulating T-cell state ( Figure 3C). In contrast, tonsillar T cells exhibited elevated levels of TRAC and IGKC, reflecting active T-cell receptor signaling and local immune activation within the lymphoid tissue ( Figure 3C). Analysis of distinct transcriptional profiles between blood and tonsillar B-cells revealed that blood B-cells exhibited higher expression of JUN, JUNB, DUSP1, and TSC22D3, indicating activation of immediate-early response and stress-regulatory pathways characteristic of circulating immune cells ( Figure 3C). In contrast, tonsillar B-cells showed increased expression of ACTG1, consistent with enhanced cytoskeletal remodeling and local activation within the lymphoid microenvironment ( Figure 3C). Together, these results indicate that tonsil immune cells adopt a tissue-adapted, immunoregulatory phenotype that differs markedly from blood-derived immune populations.

Analysis of HLA gene expression revealed compartment-specific patterns across immune cell types ( Figure 3D). HLA-C , a class I molecule, showed elevated expression in blood myeloid, B-cells, and T-cells, reflecting stronger class I–mediated antigen presentation in circulating immune populations. In contrast, HLA-DRA and HLA-DQA1, both class II molecules, were highly expressed in tonsillar myeloid cells ( Figure 3D), consistent with enhanced antigen-presenting activity within the lymphoid tissue. These findings highlight differential regulation of antigen presentation pathways between blood and tonsil compartments.

Discussion

Single-cell transcriptomics, a core component of the Method of the Year 2013, ²⁴ has become a cornerstone of modern molecular biology. It enables the unprecedented dissection of cellular heterogeneity, lineage relationships, and transcriptional dynamics across tissues and conditions. ²⁵ Yet, despite this technological progress, analytical workflows remain fragmented, inconsistently parameterized, and difficult to reproduce.

ScToolkit addresses these very limitations by unifying preprocessing, normalization, clustering, and visualization within a single reproducible framework. At its core, configuration and metadata tracking are embedded in the AnnData object, ensuring transparent provenance and consistent parameter management. This integration eliminates external bookkeeping and enables analyses to be reproduced or transferred across datasets without loss of information or interpretability.

The integration of immune cell transcriptomes from blood and tonsil demonstrates the value of this design. Consistent preprocessing preserved the biological signal without introducing artificial batch effects. Consensus marker-based annotation recovered canonical immune populations, and similarity analysis confirmed agreement with reference annotations. Differences in HLA class I and class II expression across tissues reflected genuine biological contrasts, showing that standardized preprocessing reveals true variation rather than computational artifacts.

The biological plausibility of our findings further supports the robustness of the framework. In our use case, blood-derived cells exhibited inflammatory and stress-responsive transcriptional signatures, while tonsillar cells were enriched for antigen presentation and immune regulation (as shown previously). ¹¹ These patterns emerged from careful parameter control rather than overfitting. Subclustering was intentionally avoided because the available sequencing depth of approximately 25,000 reads per cell did not support confident resolution of minor subpopulations. ²⁶

Although advances in computational methodology continue to expand the analytical toolkit for single-cell transcriptomics, no post hoc algorithm can replace a well-designed experiment. In datasets with limited sequencing depth or insufficient replication, complex denoising, imputation, trajectory, or copy number inference methods risk emphasizing structure that does not exist. These approaches often operate where signal and noise are inseparable, obscuring the boundary between enhancement and artifact generation. Until upstream protocols achieve higher RNA integrity, lower ambient contamination, and greater depth per cell, conservative and transparent analyses are preferable to speculative modeling. ScToolkit therefore deliberately omits such post hoc inference modules, prioritizing reproducibility and interpretability. As computational methods mature, they must be benchmarked on biologically diverse, non-simulated datasets generated across independent laboratories. Only under such conditions can new modules be integrated modularly without compromising scientific rigor.

Reliable downstream analysis requires data of sufficient quality. Sequencing depth, saturation, mapping rates, and base-calling accuracy directly determine the confidence of downstream inference. Even large-scale atlas datasets such as the Tonsil Atlas reflect the field’s tendency toward insufficient sequencing depth, allowing lineage resolution but limiting subpopulation analysis. We recommend that studies report sequencing depth, saturation, phred quality, mitochondrial and ribosomal fractions, and other key quality metrics to ensure reproducibility and comparability across laboratories.

We urge the field to reconsider certain analytical conventions. For example, neighborhood graphs may be more robust when computed directly on log-transformed data rather than PCA-reduced dimensions. Raw (non-normalized) counts can be used for differential expression, while log or log-normalized data should be avoided for Wilcoxon ranking, as numerical instabilities can introduce artificial ranks. While our implementation adheres to current good practices for compatibility, we advise these conventions should be revised to better reflect the properties of scRNA-seq data.

ScToolkit extends rather than replaces existing infrastructure. It provides a coherent environment that combines computational performance with transparent parameterization and curated biological annotation. In summary, it strengthens reproducibility and interpretability as cornerstones of rigorous scientific practice.

Software availibility

The complete source code for scToolkit is freely available at. Documentation and usage instructions are provided. -

Source code available from: https://github.com/AG-Boerries/scToolkit

Archived software available from: https://doi.org/10.5281/zenodo.17876855

License: BSD 3-Clause “New” or “Revised” License

Data sources

Human Tonsil Cell Atlas v2: https://zenodo.org/records/8373756 ¹¹

Tonsil v1 (h5ad): https://zenodo.org/records/14196253/files/tonsil.h5ad ¹⁹

Blood v1.1 (h5ad): https://zenodo.org/records/14196253/files/blood.h5ad ¹⁹

Use of AI-assisted editing tools

ChatGPT (OpenAI, version 5.1) was used solely as a language-editing assistant to improve readability, correct typographical errors, and refine grammar. It was not used to generate, interpret, or modify scientific content, analyses, results, or conclusions. All scientific writing, data interpretation, and methodological descriptions were produced entirely by the authors.

Data availability

The datasets analyzed in this study were obtained from publicly available repositories. Single-cell RNA sequencing data from human tissues were downloaded from the DISCO database https://www.immunesinglecell.com/ ¹⁹ and the Human Tonsil Cell Atlas https://zenodo.org/records/10373041 ¹¹ (version 2).

All datasets were processed and analyzed in accordance with the data-usage policies of the respective sources.

The Human Tonsil Cell Atlas v2 is available under a CC BY 4.0 license: https://doi.org/10.5281/zenodo.8373756. ¹¹

The DISCO Tonsil and Blood datasets are available under a CC BY 4.0 license: https://doi.org/10.5281/zenodo.14196253. ¹⁹

Acknowledgments

We thank Andreas Tsouris and Eyleen Corrales for valuable discussions and constructive feedback on this study.

References 1

Zappia

Phipson

Oshlack

: Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database. PLoS Comput. Biol. 2018;14:e1006245.

Khetan

: Single-cell RNA sequencing algorithms underestimate changes in transcriptional noise compared to single-molecule RNA imaging. Cell Reports Methods. 2024;4:100933. 10.1016/j.crmeth.2024.100933

Crowell

Morillo Leonardo

Soneson

: The shaky foundations of simulating single-cell RNA sequencing data. Genome Biol. 2023;24:1–19.

Hou

: A systematic evaluation of single-cell RNA-sequencing imputation methods. Genome Biol. 2020;21:1–30.

Cao

Yang

JYH

: A benchmark study of simulation methods for single-cell RNA sequencing data. Nat. Commun. 2021;12:1–12.

Traag

Waltman

Eck

van : From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 2019;9:1–12.

Gao

: Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat. Biotechnol. 2021;39:599–608. 33462507

10.1038/s41587-020-00795-2

De Falco

Caruso

: A variational algorithm to detect the clonal copy number substructure of tumors from scRNA-seq data. Nat. Commun. 2023;14:1–13.

Song

: Benchmarking copy number aberrations inference tools using single-cell multi-omics datasets. Brief. Bioinform. 2025;26. 10.1093/bib/bbaf076

Mahdipour-Shirayeh

Erdmann

Leung-Hagesteijn

: sciCNV: high-throughput paired profiling of transcriptomes and DNA copy number variations at single-cell resolution. Brief. Bioinform. 2022;23:1–15.

Massoni-Badosa

: An atlas of cells in the human tonsil. Immunity. 2024;57:379–399.e18. 38301653

10.1016/j.immuni.2024.01.006

Soneson

Srivastava

Patro

: Preprocessing choices affect RNA velocity results for droplet scRNA-seq data. PLoS Comput. Biol. 2021;17:e1008585.

Lan

Tang

Liu

: A survey of biclustering and clustering methods in clustering different types of single-cell RNA sequencing data. Brief. Funct. Genomics. 2025;24:10. 10.1093/bfgp/elaf010

Wolf

Angerer

Theis

: SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:1–5.

Dyer

: Ensembl 2025. Nucleic Acids Res. 2025;53:D948–D957. 39656687

10.1093/nar/gkae1071

Seal

: Genenames.org: the HGNC resources in 2023. Nucleic Acids Res. 2023;51:D1003–D1009. Genenames.org

36243972

10.1093/nar/gkac888

Baldarelli

: Mouse Genome Informatics: an integrated knowledgebase system for the laboratory mouse. Genetics. 2024;227. 10.1093/genetics/iyae031

Hounkpe

Chenou

Lima

de : HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res. 2021;49:D947–D955. 10.1093/nar/gkaa609

: Rediscovering publicly available single-cell data with the DISCO platform. Nucleic Acids Res. 2025;53:D932–D938. 39535037

10.1093/nar/gkae1108

Franzén

Gan

Björkegren

JLM

: PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database. 2019;2019:46. 10.1093/database/baz046

: CellMarker 2.0: an updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 2023;51:D870–D876. 36300619

10.1093/nar/gkac947

Ianevski

Giri

Aittokallio

: Fully-automated and ultra-fast cell-type identification using specific marker combinations from single-cell transcriptomic data. Nat. Commun. 2022;13:1–10.

: Automatic cell-type harmonization and integration across Human Cell Atlas datasets. Cell. 2023;186:5876–5891.e20. 38134877

10.1016/j.cell.2023.11.026

Method of the Year 2013.

Nat. Methods. 2014;11:1. 24524124

10.1038/nmeth.2801

Luecken

: Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods. 2022;19:41–50. 34949812

10.1038/s41592-021-01336-8

Menon

: Clustering single cells: a review of approaches on high-and low-depth single-cell RNA-seq data. Brief. Funct. Genomics. 2018;17:240–245. 29236955

10.1093/bfgp/elx044

10.5256/f1000research.191759.r476446

Reviewer response for version 1

Baldoni

Pedro L

1 2 Referee https://orcid.org/0000-0002-9510-8326 1University of Pittsburgh, Pittsburgh, USA 2Walter and Eliza Hall Institute of Medical Research, Melbourne, Victoria, Australia

Competing interests: No competing interests were disclosed.

5 5 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

The authors present scToolkit, a Python workflow built on top of Scanpy that aims to consolidate and standardise the major steps of scRNA-seq analysis. The presented method has functionalities spanning quality control, clustering, downstream visualisation, and differential expression analyses. Reproducibility in single-cell analysis is a genuine and well-documented challenge, and a unified framework embedded in the AnnData object, as presented here, is a sensible choice. The authors present a case study of immune cell data from blood and tonsil to showcase the method. Below I present some major and minor comments that I believe would strengthen the paper.

Major

- In the Introduction, the authors present an extensive list of current practices in single-cell data analysis alongside reasonable critiques. However, many of the issues raised, such as those caused by imputation or trajectory inference, are not actually addressed by scToolkit and are only marginally related to the paper. This distracts readers from the actual contributions of the work. A more concise introduction focused on motivating scToolkit's design decisions would improve the flow of the paper.

- To strengthen the paper, I believe the authors could add a more thorough discussion of the practical benefits of reproducibility afforded by the config tracking system, ideally through a side-by-side comparison between scToolkit and plain Scanpy. For example, is it possible to quantify the time savings and degree of reproducibility in a scRNA-seq dataset reanalysis with and without scToolkit? That would more clearly highlight the benefits of scToolkit from the perspective of a practising bioinformatician.

- Beyond what other reviewers have already commented regarding the limitations of Wilcoxon rank-sum tests for scRNA-seq data, I believe the authors should consider implementing or at least discussing the possibility of pseudo-bulk analysis within scToolkit, which remains the gold-standard approach for differential gene expression in single-cell data.

- It would be helpful if the authors provided code examples illustrating how users can run scToolkit directly in the paper. That would give readers a one-stop reference for getting started with the presented method.

- There is a mismatch between the cluster labels in Figure 2E and those in Figures 2B–C. The authors should address this inconsistency.

- In the second-to-last paragraph of the Discussion, the authors make recommendations for best practices in scRNA-seq data analysis ("We urge the field to reconsider certain analytical conventions."). The authors should consider adding citations to support those suggestions, or include analyses in the paper to substantiate them.

Minor

- The caption of Figure 1 needs to be fixed: "Additional panels display pct_HK_gtex (housekeeping genes from GTEx), (immunoglobulin genes from the Ensembl GTF), and pct_HGNC_pseudogene (pseudogenes from HGNC)." A label appears to be missing before the parenthetical description of immunoglobulin genes.

- Typo on page 8: "As an exampler" should be "As an example."

- It would be clearer to use a diverging colour palette (e.g. blue–white–red) with white centred at zero in Figures 3B–C.

- "ScToolkit" and "scToolkit" are used interchangeably throughout the manuscript. A single form should be adopted and applied consistently.

- There is a missing URL in the Software availability section: "The complete source code for scToolkit is freely available at. Documentation and usage instructions are provided."

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Statistics, bioinformatics, transcriptomics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

10.5256/f1000research.191759.r454905

Reviewer response for version 1

Köster

Johannes

1 Referee https://orcid.org/0000-0001-9818-9320 1Institute for AI in Medicine (IKIM), University of Duisburg-Essen, Essen, Germany

Competing interests: No competing interests were disclosed.

23 4 2026

2026

recommendation

approve-with-reservations

The authors present scToolkit, a Python library for conducting single cell analysis. The library is motivated by the shortcomings of the ad-hoc nature of many single cell analyses conducted nowadays, which may lack (hyper-) parameter documentation, QC, and suffer from robustness and thus reproducibility issues. ScToolkit is presented as an approach to simplify the use of scanpy in a way that overcomes these issues.

Comments

Page 1: word repetition (Human Tonsil Atlas atlas)

Page 4/5: For some of the mentioned improvements over scanpy, I wonder whether the effort has been made to contribute them upstream? The improvements could still be described in this paper, but all scanpy users would benefit. Example: in the QC metric calculation section (Page 4) the authors say that scanpy's metrics have been adapted by simplifying names, removing redundant keys, and using more efficient data types for certain values (16-bit floats for percentages).

Page 5: The clustering improvements should be explained more formally and proven. For example, they could be validated/benchmarked in this manuscript, ideally with a given (simulated or extensively manually curated) ground truth.

Page 5: I wonder why DEG analysis relies on Wilcoxon instead of utilizing RNA-seq specific DE models like EdgeR (cf. the single-cell DE benchmark paper by Soneson et al.). This would allow batch handling and have more accurate distributional assumptions. Also, Wilcoxon assumes that the distributions are equal under the null hypothesis. Such an assumption might be violated by effects caused by interaction with other genes or batch effects.

Page 5: the authors should explicitly show what's wrong with scanpy's fold changes and how they are fixed by scToolkit.

Page 5: the mentioned bi-directional conversion between dictionaries and dataframes is unclear to the reader at this stage. Do the authors mean that gene sets can be provided in both formats? If that is meant, why bi-directional, since scToolkit must have decided for one internal representation. Maybe I just don't get it and it needs more explanation, sorry :-).

Page 6: the used thresholds for keeping cells/genes in the use case should be motivated somehow. How where they chose, and what was the rationale each time.

Page 6: I am confused by the wording when cell exclusion is explained. First, the authors say that cells were excluded when fewer than 500 genes were expressed, in the next sentence they say cells with fewer than 10 unique genes were removed. Is the latter also expressed genes? How is expression of a gene defined in terms of the minimum read count for that, what would the additional "unique" mean compared to the sentence before?

Figure 1A: Almost all violin plots become narrower towards the lower bound of the plot. This is irritating since the authors mentioned that only the top 10% quantile range of each metric is shown. Could it be that the violin plots have been estimated on that part of the data only and we see a boundary effect of the density estimate?

Figure 1A: The claim that pct_GTF-ID_genes is multimodal is not well visible in the plots. I wonder whether features like that would not be better visible with vertical histograms instead of violin plots. That would also induce less model specific bias, make the superposition of the individual dots superfluous (if bin-sizes are chosen small enough to capture local features accurately) and avoid density estimate artifacts towards the boundaries.

Page 8: the strategy for obtaining consensus marker genes for cell types leaves some questions to me. I would like to read about a motivation of choosing the top 20. I can imagine that there will be cell types were choosing 20 means included some that are unique to certain databases and others were choosing 20 means skipping genes that are still recurrent for the same cell type in some of the databases. Wouldn't it be better to define a criterion like "the gene has to occur for the cell type in at least 3 of the 4 databases" instead of taking a fixed number from a ranked list?

Page 8, top: "exampler" -> example; "redived" is not an english word I think...

Page 8, clustering: I cannot follow the argument how the UMAPs in Fig 2A show that there are no batch effects. In particular since UMAP can, depending on the used parameters, suffer from both artificially glueing stuff together as well as artificially ripping stuff apart.

Page 8, clustering: I am confused abouve Figure 2D. How can the Granulocyte cells derive from ORA have a jaccard score of almost 1 with both the Monocyte and the Granulocytes in the reference annotation? Further, the label of the x-axis seems to be missing.

Page 8, clustering: I think there is a contradiction. At the beginning of the paragraph the authors claim there there are no (visible) batch effects between the cell origins. At the end of the paragraph, the authors claim that their approach makes specialized batch correction unnecessary. However, if this dataset does not exhibit batch effects, how can analyzing it show that batch effect correction is not needed?

Figure 2E: why does the legend show T-cell for the green cluster while it is annotated as B-cell with both ORA and the reference annotation?

Page 8, differential expression analysis: cf. my comment on Wilcoxon from above. The authors should at least verify that for non-diffexp genes the distributional properties are the same. I am further unsure how the tests were performed: cluster of interest against each other cluster separately or against the union of all other clusters? In the latter case, the Wilcoxon assumption about the same distribution under the null hypothesis would be violated for sure (e.g. generating mix distributions with up in the first other cluster, down in the second other cluster).

Page 8, differential expression analysis: what adjustment method is used for the p-values? Since the genes in a cell might depend on each other in their expressions, was an FDR control method used that accounts for this, e.g. Benjamini–Yekutieli?

Page 8, differential expression analysis: although probably somewhat accepted, I would like to see a discussion about the effects of the essential pseudoreplication approach that is taken here. Cells are treated as independent although they aren't. Will this hamper the results? Can the effect be quantified? What alternatives are there (e.g. mixed effects models) and why have they been ruled out here? Maybe one could even compare and see whether the results are different then?

Software: the usage/installation instructions are overcomplex. Since scToolkit is a Python library, I strongly suggest to properly package it and upload it to pypi as well as providing a conda recipe in Bioconda. This way, people can simply install it with a conda frontend of their choice (conda, pixi, micromamba; note: mamba is deprecated and unmaintained nowadays). No need for all those complex steps outlined in the readme. The recipe on Bioconda can even include the required R packages for the seurat compatibility layer. In case you want to avoid those dependencies for the average user, the recipe can have two outputs, a minimal one without the R stuff, and a full one including those.

Reproducibility: in particular since this is mentioned in the paper, one would expect the code for the data analysis conducted in the use case to be available. Moreover, it should be available ideally in a way such that the reader can obtain the data and reproduce all figures automatically.

Is the rationale for developing the new method (or application) clearly explained?

Yes

Is the description of the method technically sound?

Partly

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Partly

Reviewer Expertise:

bioinformatics, reproducibility

10.5256/f1000research.191759.r464687

Reviewer response for version 1

Hemberg

Martin

1 Referee https://orcid.org/0000-0001-8895-5239 Ammula

Pranay

1 Co-referee 1Brigham and Women's Hospital, Boston, Massachusetts, USA

Competing interests: No competing interests were disclosed.

2 4 2026

2026

recommendation

approve-with-reservations

Here, Krolla et al, present scToolkit, a python package which is meant to enhance the reproducibility of scanpy. The single cell field currently has a very rich ecosystem of computational methods which on the one hand is advantageous as it provides users with many options. On the other hand, it can also be bewildering and make it difficult for a user to gain an overview. A particular challenge, not just for single cell genomics, is to ensure reproducible findings. As such, scToolkit is a welcome addition, that seeks to facilitate and streamline the analysis process. I believe that it will be of use to the many researchers using Scanpy for their analysis, but I nevertheless think that the authors need to address the following issues:

1. The introduction is quite fragmented, and it reads more like bullet point list of specific problems rather than trying to frame a problem, discussing the state of the art, and existing solutions. In particular, it would be useful to mention if there are other packages (including for R) that have similarities to scToolkit.

2. The set of functions available for storing QC information seem very useful. However, I believe that the authors have missed an opportunity here. What would be very useful is to have a set of functions for comparing and visualizing QC metrics across studies to facilitate these types of comparisons.

3. I suspect that the gene sets used in fig 1b have a lot of overlap. Would it be possible to also draw an upset plot to illustrate this?

4. The authors highlight differences across the various metrics in 1a, but it would be useful to have statistical tests to support these claims.

5. Just eyeballing the data to conclude that there are no batch effects is not all that convincing. I suggest that the authors use one or more quantitative metrics (e.g. batch entropy) to assess.

6. In 2d, the information from the numbers and the colors are redundant. I would suggest using absolute numbers instead to add more information.

7.The environment YAML file for installation does not function properly on HPC systems as it experiences missing dependencies and complete build failures. The authors should make sure to test across more platforms.

8.The code needs both 'SLURM_JOB_CPUS_PER_NODE' and 'marsilea' to run correctly because it requires users to create a database symlink manually, without which the execution fails, and this becomes clear only during execution.

9.The function 'get_n_unique()' does not process dense inputs as mentioned, while the function 'save_h5ad()' requires users to execute 'setup_adata()' before using it, this requirement remains undocumented.

Is the rationale for developing the new method (or application) clearly explained?

Partly

Is the description of the method technically sound?

Yes

Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Yes

If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes

Are sufficient details provided to allow replication of the method development and its use by others?

Yes

Reviewer Expertise:

Computational genomics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.