scANANSE gene regulatory network and motif analysis of single-cell clusters

The recent development of single-cell techniques is essential to unravel complex biological systems. By measuring the transcriptome and the accessible genome on a single-cell level, cellular heterogeneity in a biological environment can be deciphered. Transcription factors act as key regulators activating and repressing downstream target genes, and together they constitute gene regulatory networks that govern cell morphology and identity. Dissecting these gene regulatory networks is crucial for understanding molecular mechanisms and disease, especially within highly complex biological systems. The gene regulatory network analysis software ANANSE and the motif enrichment software GimmeMotifs were both developed to analyse bulk datasets. We developed scANANSE, a software pipeline for gene regulatory network analysis and motif enrichment using single-cell RNA and ATAC datasets. The scANANSE pipeline can be run from either R or Python. First, it exports data from standard single-cell objects. Next, it automatically runs multiple comparisons of cell cluster data. Finally, it imports the results back to the single-cell object, where the result can be further visualised, integrated, and interpreted. Here, we demonstrate our scANANSE pipeline on a publicly available PBMC multi-omics dataset. It identifies well-known cell type-specific hematopoietic factors. Importantly, we also demonstrated that scANANSE combined with GimmeMotifs is able to predict transcription factors with both activating and repressing roles in gene regulation.


Introduction
Single-cell RNA-sequencing (scRNA-seq) and single-cell ATAC-sequencing (scATAC-seq), enable measurement of gene transcripts (Islam et al., 2014) and genome accessibility (Buenrostro et al., 2015) at single-cell resolution.By performing single-cell sequencing on complex biological tissues and systems, various types of cells present in the system can be identified.Furthermore, gradual changes during development and differentiation trajectories can be scrutinised.The transcriptome and accessible genome of various cell populations can be quantified, which is not obtainable using bulk analyses (Huang, 2009;Li and Clevers, 2010).Capturing heterogeneity is vital in studying complex tissues, or while studying gradual processes such as development and differentiation, in which not all cells develop at the same rate or follow the same trajectory (Welch, Hartemink and Prins, 2016).
One of the main drivers of differences in cellular identity and developmental processes are transcription factors (TFs).To regulate gene expression, many TFs bind the DNA directly on DNA binding motifs.These motifs are present within cis-regulatory elements (CREs), which are functionally categorised as promoters, enhancers, or insulators (Lambert et al., 2018;Chen and Pugh, 2021).These cis-regulatory elements (CREs) can be used to scan for binding motifs.However, motif enrichment does not take into account the target of CREs, the nearby genes.To better predict the impact and importance of TFs, modelling gene regulatory networks (GRNs) is preferable.
By combining (differential) gene expression, genome accessibility, and motif enrichment, with the nearby location of target genes, it is possible to generate a directed GRN.Software to predict GRNs have been actively developed since the emergence of next-generation sequencing (Mercatelli et al., 2020).The addition of genome accessibility data and incorporation of long-range CREs is a successful method to model directed-GRNs (Xu et al., 2021;González-Blas et al., 2022;Kamal et al., 2022).Since both scRNA-seq and scATAC-seq are available, performing directed GRN analysis can now be applied to single-cell datasets.
There are multiple single-cell-based GRN tools available, capable of combining scRNA-seq and scATAC-seq data (Kamimoto, Hoffmann and Morris, 2020;Fleck et al., 2021;González-Blas et al., 2022;Kartha et al., 2022).However, since single-cell data contains shallow coverage per cell and one of the main challenges these tools face is using this sparse data.Furthermore, since these tools are specifically designed for single-cell data, making comparisons of their results with available bulk datasets is challenging.
In contrast, using single-cell data from clusters as pseudo-bulk can be used relatively straightforwardly as input for many GRN tools available.To identify key TFs using GRN approaches, we previously developed the gene regulatory network analysis software ANANSE (Xu et al., 2021).ANANSE has multiple advantages: it incorporates CRE signal in 100kb windows, contains extensive TF binding models trained on the REMAP database, and can analyse data on all vertebrate species and even on non-vertebrate species with some additional steps.Theoretically, ANANSE could be run on singlecell pseudo-bulk data; however, the steps involved in generating data per cluster and running all the needed pairwise comparisons are labour-intensive and non-intuitive, while they require extensive bio-informatic skills.
Here, to enable ANANSE single-cell cluster analysis, we have developed an analysis pipeline called single-cell ANANSE (scANANSE).This pipeline consists of newly developed packages to export data from single-cell objects, either Seurat objects using the R implementation (AnanseSeurat), or from Scanpy objects using the Python implementation (AnanseScanpy).Next, an automated snakemake pipeline of ANANSE facilitates the GRN modelling.In parallel, it integrates motif enrichment analysis using GimmeMotifs (van Heeringen and Veenstra, 2011;Bruse and Heeringen, 2018).This addition is used to identify TFs with repressive properties, which are generally not properly predicted by ANANSE.Lastly, transcription factor influence score and motif enrichment results can be imported back into the singlecell object for downstream analysis and visualisation.
The performance of the scANANSE pipeline is demonstrated on a publicly available PBMC multi-omics dataset, as an example workflow including the installation of all software needed to run the analysis.In this PBMC case study, scANANSE uncovered many well-known activating TFs within the hematopoietic lineages.Including CEBPD and SPI1 in monocytes, EBF1 and MEF2C in B-cells, and STAT4 and LEF1 in T-cells.In addition, motif enrichment and expression correlation identify both the well known repressors PAX5 and STAT6 within B-cells.

Implementation
The scANANSE pipeline consists of two components: a package to export data from and import data towards single-cell objects, and a snakemake implementation of ANANSE called anansnake (Figure 1).Crucial steps before running scANANSE are pre-processing, quality control, and clustering of single-cell data.For these steps, a large number of welldescribed workflows are available (Zappia and Oshlack, 2018; Luecken and Theis, 2019; Baek and Lee, 2020).scANANSE exports data from the single-cell object of choice.Transcripts Per Million (TPM), Differential Expressed Genes (DEGs) and peak counts need to be calculated based on the single-cell objects supplied.For Seurat objects in the programming language R, the R package "AnanseSeurat" was developed to perform these steps.While for Scanpy objects in the programming language Python, the Python package "AnanseScanpy" was developed.After pre-processing and clustering, data is exported using either AnanseSeurat or AnanseScanpy.Next, Anansnake automatically runs ANANSE after which the influence scores and motif enrichment results with AnanseSeurat or AnanseScanpy are imported.In parallel, Anansnake runs motif enrichment analysis using gimme maelstrom, and the motif results are imported and linked to the highest correlating TFs using the single-cell object scRNA-seq data.
The TPM counts, DEGs, and ATAC peak counts can be exported from one single-cell object containing both the scRNA-seq data and scATAC-seq data, or from two separate single-cell objects.In the case of two single-cell objects, these objects need to share their cluster names, e.g. by transferring anchors between separate scRNA-seq and scATACseq datasets (Stuart et al., 2019).As such, scRNA-seq and scATAC-seq data from multiple studies or experiments can be combined and used as input.
By default, scANANSE compares each cluster to a gene regulatory network built from the average expression and gene accessibility of all clusters.This average network is used as a common comparison to compare all clusters.These comparisons result in an average GRN 'TF-influence' score.This score quantifies the importance of a TF driving the differences between a specific cluster and the average of all other cell clusters.In this way, the TF influence score can be compared across multiple clusters.In addition to this general approach, more detailed direct cluster-to-cluster GRN analyses are possible.
One downside of the GRN modelling of ANANSE is the lack of prediction of repressive TFs.To counteract this blind spot of the algorithm, motif enrichment with GimmeMotifs is performed in the scANANSE pipeline.It not only performs motif enrichment but is combined with a correlation of motif-z-scores and TF expression across clusters within the singlecell object.This addition enables the ability to predict repressive TFs.
Finally, both AnanseSeurat and AnanseScanpy can be used to import the TF influence and motif enrichment scores back into your single-cell object for further visualisation and analysis.All the source code and the conda environment YAML files used to generate the results presented in this article are available in Github and Zenodo (Arts et al., 2022).

Operation
Minimal system requirements A computer running UNIX, Linux, Windows Subsystem for Linux (WSL or Mac OS can run scANANSE.A minimum of 32 GB of RAM and 100 GB Of disk space is needed for a typical analysis, however, an amount of 64 GB of RAM is recommended to decrease runtime.

Use cases: PBMC monocytes
The multi-omics dataset generated on human Peripheral blood mononuclear cells (PBMCs) publicly provided by 10Â (PBMC from a Healthy Donor (v1, 150Â150) Single Cell Multiome ATAC + Gene Expression Dataset by Cell Ranger ARC 2.0.0, 10Â Genomics, 2022, December 20) is used as a case study.The scANANSE pipeline can also handle separate scRNA-seq and scATAC-seq objects with identical cluster names.However, within this example, scRNA-seq and scATAC-seq are part of the same single-cell object.
Part 1: Installation and setup The package management system Conda is installed with two environments: anansnake and scANANSE.The following folder structure is used: 1a.Create folders mkdir -p scANANSE/analysis mkdir -p scANANSE/data 1b.Install Conda The operating system and computing environment are set up as listed in the minimal system requirements.Next, Conda is installed.

1e. Install hg38
The location where Genomepy installs genomes is set using the -g flag.Since UCSC has three annotations for hg38, the version with HGNC gene names is selected, using --UCSC-annotation.scANANSE requires HGNC gene names to run.conda activate anansnake genomepy install hg38 -g scANANSE/data --UCSC-annotation refGene

1f. Install AnanseSeurat and R packages
There are code blocks equivalent for exporting and visualising the data in python using Ananscanpy.See the extended data file "AnanseScanpy_equivalent.pdf" in the extended data (Arts et al., 2022) for these same steps but in Python.If RStudio needs to be installed on your system, see "install_Rstudio.pdf" in the extended data on Zenodo (Arts et al., 2022).

conda activate scANANSE rstudio
From R (studio): install.packages("AnanseSeurat")Part 2: Quality control and clustering of scRNA-seq and scATAC-seq data In this example we use data from 10x pre-processed by a vignette from Signac (2022).This dataset comes with a vignette performing default quality control, clustering, and annotation from the PBMC atlas from Hao et al. (2021).Proper quality control and clustering are vital for all single-cell analyses for these topics, however, there already exist some excellent reviews about these topics (Zappia and Oshlack, 2018;Luecken and Theis, 2019;Baek and Lee, 2020) For the ATAC-seq data, a matrix containing the counts per peak per cluster is generated.For RNA-seq, CPM equivalent values are needed.Since the data is UMI normalised, CPM is already equivalent to regular depth normalised data (Phipson, Zappia and Oshlack, 2017).By default, scANANSE compares all clusters to a network based on the average values of all clusters.Additional comparisons can be specified, in this case, B-naive and B-memory cells were also specified to compare directly to each other.

CD4-Naive CD4-TCM average
Part 4: Anansnake Next, snakemake is run from within a screen session.This takes approximately 3 hours per cluster plus 2 hours for motif enrichment analysis, but this is also highly dependent on computer speed.With less than 64 GB of RAM available, we recommend downscaling the core number to a maximum of 6 cores.With more RAM and more cores available the core count should be increased to reduce analysis time.
Additionally, it is possible to add extra samples and/or networks to the anansnake run.This enables including other samples and other networks in your comparisons.When performing additional anansnake comparisons please go through the anansnake documentation in detail.

5a. Import the ANANSE results
After running ANANSE with anansnake, the influence output is imported back into the single-cell object.
Medvedovic et al., 2011, p. 5).In particular, PAX5 is an intriguing finding since it is not only well known to promote B-cell genes, but also to repress non-B-cell lineage genes (Boller and Grosschedl, 2014).This repressive property is however not included in the ANANSE analysis.And its prediction is likely attributed to the smaller effect of gene activation PAX6 has on specific target genes.
T-cell TFs include both GATA3 and LEF1, which are crucial for specifying the T-cell fate (Novershtern et al., 2011).

5d. Visualise TF expression and influence on a UMAP
The presence of the influence scores enabled clear visualisation of the influence and expression of specific TFs across the dataset.As an example, three TFs are visualised with a wide variety of influence and expression across clusters (see Figure 3).MemoryInfluence <-read.Optional part: Motif enrichment for predicting repressive factors Since ANANSE's assumptions for GRN modelling are not valid for repressive factors, one limitation is the inability to reliably predict repressive TFs.Motif enrichment can be used for identifying motifs with reduced accessibility, however due to the lack of a one-on-one link of motifs and TFs, and the difference of these interactions between tissues, it is tricky to reliably link motifs with their most relevant factors in the cell type of interest.
However, with single-cell cluster data, it is possible to link motifs and TFs based on motif and expression correlation across multiple clusters.This approach does enable scANANSE to identify potential repressive factors.It is however a step down from the GRN modelling approach, but for identifying potential repressive factors it is an easy step to incorporate, which we therefore choose to include.
We will first incorporate the enrichment result after running anansesnake.

Link TFs to motifs based on their correlation coefficient
The enriched motifs are linked to TFs based on the non-redundant motif-TF database generated by GimmeMotifs.A correlation score is calculated between the motif-z-scores and TF expression values.When multiple TFs map to the same motif of interest, the TF with the highest absolute correlation is linked to this motif.After linking all motifs, one TF can be linked to multiple motifs.In that case, there are multiple options for selecting the most relevant motif.
First of all, it is possible to take the mean motif score, secondly by selecting the motif with the most variable signal, or thirdly by selecting the motif with the highest absolute correlation between enrichment and expression.Here we use the motifs with the highest correlation to the expression.
Finally, two assays are added to the single-cell object, one consisting of a positive correlation with linked motifs, which indicates a TF promoting genome accessibility, and one assay consisting of a negative correlation with linked motifs, which indicates TFs repressing genome accessibility.A TF can be present in both assays when it is linked both with a motif with a positive correlation and a motif with a negative correlation.

Conclusions
Here we demonstrate that scANANSE is able to decipher the gene regulatory networks driving the identity of single-cell clusters.This enables the identification of TFs that drive the cellular identity of single-cell clusters of scRNA-seq and scATAC-seq datasets.
Currently, there are multiple other tools available and under development for performing GRN analysis using a combination of scRNA-seq and scATAC-seq data.Examples include software such as SCENIC+ (González-Blas et al., 2022), Pando (Fleck et al., 2021), CellOracle (Kamimoto, Hoffmann andMorris, 2020) andFigR (Kartha et al., 2022).These tools have the advantage and the challenge of calculating GRNs using individual cells.While they are not relying on clustering before GRN analysis, these tools struggle at identifying low expressed target genes and TFs since individual cells have low transcriptome coverage.Comparing and benchmarking all these single-cell GRN tools is beyond the scope of this paper, but would be an exciting addition to the field in the future.
scANANSE has some clear advantages.First of all, it has the ability to analyse single-cell data generated from all vertebrate genomes.When working with non-vertebrate data, extra steps for identifying homologous genes across phyla are required before running scANANSE.For more information on that topic, see the ANANSE documentation on the motif database.This flexibility enables GRN analysis on single-cell data from a high variety of organisms.Furthermore, due to the pseudo-bulk approach, it is possible to compare single-cell cluster gene regulatory networks against networks generated from traditional bulk sequencing data.Although the amount of publicly accessible single-cell datasets is growing, there is an even larger amount of bulk sequencing datasets available.Moreover, the possibility and flexibility of comparing GRNs from multiple sources is another advantage of scANANSE, extra care and validation is still needed when using networks from different data sources.
scANANSE makes a few assumptions that are important to note regarding the average network comparison.Using the average network as the background comparison against each cluster-specific network enables the identification of TFs driving each specific cluster.In the case of small cluster numbers, this approach is however limiting the reliability and the number of factors identified since the average network contains accessibility data from all clusters including the cluster being compared.In cases with low cluster numbers, it is therefore recommended to run scANANSE including pairwise comparisons between all the clusters.
Another limitation of the GRN modelling of ANANSE is its inability to predict repressive transcription factors, or factors with context-dependent and/or repressive properties (Krishnakumar et al., 2016;Pang and Snyder, 2020).While deciphering molecular mechanisms, the inclusion of repressive factors and factors with context-dependent purpose is highly useful (Gaston and Jayaraman, 2003;Bauer, Buske and Bailey, 2010;Arnold et al., 2013).ANANSE however uses a rank mean approach which assumes all TF target gene relations are activating, while furthermore requiring a TF to be higher expressed.These assumptions are not always applicable to TFs with repressive or context-dependent functions (Xu et al., 2021).To alleviate some of this limitation, we have integrated motif enrichment analysis from the GimmeMotifs toolkit.Combining the motif z-score with a correlation of TF expression provides a straightforward tool to link motifs to the most relevant TFs which can be repressive.However, this approach does not take into account the potential combinatorial function of TFs (Zeitlinger, 2020) and/or missing interactions in the TF to motif database.
With scANANSE, we have implemented a robust and capable toolkit to identify key TFs important for driving cellular identity and differentiation in single-cell data.It relies on solid pseudo-bulk signals and proven bulk-GRN approaches to identify the TFs of interest.

Extended data
Preprocessed single cell objects, code to install Rstudio and the python code equivalent for all the steps are available as well in Zenodo archive as extended data.
This project contains the following extended data: • rna_PBMC.h5ad(Processed Scanpy object containing the PBMC dataset scRNAseq data after quality control clustering and annotation) • atac_PBMC.h5ad(Processed Scanpy object containing the PBMC dataset scATACseq data after quality control clustering and annotation) • preprocessed_PBMC.Rds (Processed Seurat object containing the PBMC dataset after quality control clustering and annotation) • Install_Rstudio.pdf(code to install Rstudio on your machine) • AnanseScanpy_equivalent.pdf (code of the Python equivalent of all R code present in this manuscript) Comments: "However, since single-cell data contains shallow coverage per cell and one of the main challenges these tools face is using this sparse data."-> Would it be possible to rephrase this sentence to convey the meaning in a better way "In contrast, using single-cell data from clusters as pseudo-bulk can be used relatively straightforwardly as input for many GRN tools available."-> I suggest rephrasing "relatively straightforwardly" Bio-informatic/bulk-GRN/directed-GRNs -> Hyphens are not needed here transcription factor and gene regulatory as TF and GRNs -> Abbreviations for these words were defined in the beginning so they should be used afterwards.
"a package to export data from and import data towards single-cell objects" There is a word missing or sentence construction not clear.
Please cite R and Python in the text for example in "objects in the programming language Python" "One downside of the GRN modelling of ANANSE is the lack of prediction of repressive TFs." ->This is very interesting, would it be possible to provide an example or a reference to this statement.-> Why is it this case?and how does your approach-combining with motif data-helps?Generally, most TFs are activators and do so by opening the chromatin for transcription.
"Since ANANSE's assumptions for GRN modelling are not valid for repressive factors, one limitation is the inability to reliably predict repressive TFs." -> Can you please state these assumptions I understand that the motif infers TF motif from DEGs and then links them to candidate TFs using expression correlation.Did the authors consider using TF motifs from HT-SELEX experiments in HOCOMOCO or CIS-BP databases instead?
In Table 3, gene names are Italic but not in table1.TFs are proteins and so it is possible to not italicize their symbols in the text and table 5 for example.
"Although the amount of publicly accessible single-cell datasets is growing, there is an even larger amount of bulk sequencing datasets available" -> This sentence needs to be connected to its main idea.
Finally, I briefly reviewed ANANSE since scANANSE inherits parts of its strengths and weaknesses.A recent benchmark (FIgure 3A in DOI 10.15252/msb.202311627)stated that "enhancer-based network ANANSE showed very poor performance across all cell types".I believe that this opportunity to address these comments by doing additional benchmarks and identifying the parameter regions/specific use case where scANANSE operates in its optimal regime.I think this conversation can help the community advance on these topics.

"In contrast, using single-cell data from clusters as pseudo-bulk can be used relatively straightforwardly as input for many GRN tools available." -> I suggest rephrasing "relatively straightforwardly"
We rephrased the sentence into: "In contrast, using single-cell data from robust clusters as pseudo-bulk can be used as input for many GRN tools available" Bio-informatic/bulk-GRN/directed-GRNs -> Hyphens are not needed here We removed the hyphens transcription factor and gene regulatory as TF and GRNs -> Abbreviations for these words were defined in the beginning so they should be used afterwards.

We fixed inconsistent abbreviations
"a package to export data from and import data towards single-cell objects" There is a word missing or sentence construction not clear.
We rephrased the sentence into: "The scANANSE pipeline consists of two components: a package to export and import data from and towards single-cell objects, and a snakemake implementation of ANANSE called anansnake" Please cite R and Python in the text for example in "objects in the programming language Python" We included citations to R and Python "One downside of the GRN modelling of ANANSE is the lack of prediction of repressive TFs." ->This is very interesting, would it be possible to provide an example or a reference to this statement.-> Why is it this case?and how does your approach-combining with motif data-helps?Generally, most TFs are activators and do so by opening the chromatin for transcription.
This sentence is based on the "mean rank' concatenation performed by ANANSE which combines expression data & binding data, and due to the mean rank approach rewards positive correlation (and penalizes repressive anti-correlations).We have added a citation to the original paper after this statement to make clear where this statement arises from.We have added an additional sentence linking to the original ANANSE manuscript after rephrasing the mentioning of the current limitations towards: "One downside of the GRN modelling of ANANSE is the lack of prediction of repressive TFs.In Table 3, gene names are Italic but not in table1.TFs are proteins and so it is possible to not italicize their symbols in the text and table 5 for example.
We fixed the non-italic gene names in table1, and non-italicized the TFs in table 5 and the text.
"Although the amount of publicly accessible single-cell datasets is growing, there is an even larger amount of bulk sequencing datasets available" -> This sentence needs to be connected to its main idea.
We rephrased the paragraph to connect it better to the main idea: "Furthermore, due to the pseudo-bulk approach, it is possible to compare single-cell cluster gene regulatory networks against networks generated from traditional bulk sequencing data.Although the amount of publicly accessible single-cell datasets is growing, there is an even larger amount of bulk sequencing datasets available.Moreover, the possibility and flexibility of comparing GRNs from multiple sources is another advantage of scANANSE, extra care and validation is still needed when using networks from different data sources." Finally, I briefly reviewed ANANSE since scANANSE inherits parts of its strengths and weaknesses.A recent benchmark (FIgure 3A in DOI 10.15252/msb.202311627)stated that "enhancer-based network ANANSE showed very poor performance across all cell types".I believe that this opportunity to address these comments by doing additional benchmarks and identifying the parameter regions/specific use case where scANANSE operates in its optimal regime.I think this conversation can help the community advance on these topics.

It is important to note that Kamal et al. only tested one ANANSE network, from one specific cell type (in vitro-derived macrophages). In total, they benchmark their GRaNIE method on three cell types (all hematopoietic in origin). This means that it is very hard to draw any solid conclusions on the performance of either method in general.
The chosen evaluation metric is interesting and adds to the methods in which GRNs could be evaluated.However, it also has some possible pitfalls.

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes
Louis, MO, USA The authors have developed scANANSE, a pipeline for gene network analysis and transcription factor binding motif analysis.scANANSE is a software tool that facilitates the creation of cell typespecific networks from gene expression data and ATAC-seq data.scANANSE employs pseudo bulk data, enabling comparisons with algorithms that rely on bulk data.scANANSE serves as an extension package of ANANSE, incorporating additional features such as data conversion from single cell data and the ability to return network analysis results to single cell data.Although data processing and manipulation are often overlooked aspects in general, they hold significant importance.This software offers the advantage of easily performing such operations, which enhances its practicality.The manuscript effectively conveys the purpose, adaptations, and limitations of scANANSE, while also providing clear examples of code usage.The overall organization is well-executed.
I have no major comments or requests to add.I will provide some minor feedback regarding specific findings from utilizing the software.
1.I got error when importing SeuratDisk.The SeuratDisk may not be included in the installation.Also, it seems SeuratDisk is not available from CRAN for R version 4. I installed SeuratDisk from GitHub.
It may be helpful to add this process in installation.
"" Attaching SeuratObject Error in library(SeuratDisk) : there is no package called 'SeuratDisk' "" 2. At page8, the file name of Seurat object may include typo.The file name in R script is "preprocessed_PDMC.Rds".It should be "preprocessed_PBMC.Rds"The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure1.An overview of the single-cell ANANSE pipeline.After pre-processing and clustering, data is exported using either AnanseSeurat or AnanseScanpy.Next, Anansnake automatically runs ANANSE after which the influence scores and motif enrichment results with AnanseSeurat or AnanseScanpy are imported.In parallel, Anansnake runs motif enrichment analysis using gimme maelstrom, and the motif results are imported and linked to the highest correlating TFs using the single-cell object scRNA-seq data.

Figure 2 .
Figure 2. Heatmap of the influence scores.This heatmap depicts the influence scores of the top five highest influential TFs per cluster.Referenced TFs in the text are in bold.

Figure 3 .
Figure 3. Expression and influence visualisation upon the UMAP.(A) UMAP of the PBMC single-cell object with the cell identities labelled.(B) Normalised expression values of STAT4, LEF1, and MEF2C on the single-cell object.(C) Influence scores of STAT4, LEF1, and MEF2C on the single-cell object.
("darkgrey", "#fc8d59")) pdf('./scANANSE/analysis/ANANSE_highlight.pdf',width=10,height=10,paper='special') print(Annotated_plot) print(plot_expression|plot_ANANSE) dev.off()Part 6: Specific cluster comparison Although all B-cell clusters were relatively similar when compared to the average network, it is possible to directly compare both clusters.This uncovers TFs driving more subtle differences between the cell types.This direct cluster-tocluster comparison is performed by adding the two clusters in part 3 as an additional contrast.When comparing Naive B-cells and Memory B-cells, FOXP1 and BACH2 were identified as important factors driving Memory B-cell maturation compared to naive B-cells.This is in line with previous publications(Itoh-Nakadai et al., 2014; Patzelt et al., 2018).Furthermore, EBF1 and SPIB were identified as driving Naive B-cells, this is also in line with previous research(Schmidlin et al., 2008; Györy et al., 2012).Thus, these results illustrate the possibility of running comparisons on similar clusters within single-cell datasets to further identify TF networks that define cell types (Figure4).

Figure 4 .
Figure 4. ANANSE can perform direct cluster-to-cluster comparisons.The TF influence scores of TFs comparing Naive B-cells and Memory B-cells; higher influence of factors with negative fold changes are more important within memory B-cells; higher influence of factors with positive fold changes are more important in Memory B-cells.Circle size correlates with the number of direct target genes.Gene expression log2 fold change between Naive B-cells and memory B-cells on the X axis.
the rationale for developing the new software tool clearly explained?Yes Is the description of the software tool technically sound?YesAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?YesIs sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?YesAre the conclusions about the tool and its performance adequately supported by the findings presented in the article?YesYesIs the description of the software tool technically sound?○YesAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?○Yes Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?○YesAre the conclusions about the tool and its performance adequately supported by the findings presented in the article?○YesCompetingInterests: No competing interests were disclosed. .

Table 1 .
TPM file data.Example of values and layout of the TPM.tsv file generated by the export_CPM_scANANSE() function.

Table 2 .
Peak counts file data.Example of values and layout of the Peak_Counts.tsvfile generated by the export_ATAC_scANANSE() function.

Table 3 .
Marker gene file data.Example of values and layout of the hg38_cluster_average.diffexp.tsvfile generated by the DEGS_scANANSE() function.
5b. Top five influential TFs per clusterNext, the top five TFs per cluster are identified from the influence table.

Table 4 .
Influence table.Example of the influence data frame generated by per_cluster_df(assay = 'influence').
(Rosa et al., 2007;Novershtern et al., 2011;clude, SPI1 which is well known to regulate human monocyte differentiation towards dendritic cells and is identified in both monocytes and dendritic cells(Rosa et al., 2007;Novershtern et al., 2011; Zhu et al., 2012).While the CEBP gene family, including the identified CEBPD, is vital for the transduction of B-cells into macrophages(Bussmann et al., 2009).

Table 5 .
Top five TF influence scores per cell type.Referenced TFs in the text are in bold and highlighted.

Table 6 .
Motif score table.Example of the Motif score data frame generated by per_cluster_df(assay = 'maelstrom').

The method used by ANANSE are described in detail in the original manuscript. However the sentence was rather confusing so we rephrased it into: "The original ANANSE method was not developed to reliably predict repressive factors (Xu et al., 2021)."
I understand that the motif infers TF motif from DEGs and then links them to candidate TFs using expression correlation.Did the authors consider using TF motifs from HT-SELEX experiments in HOCOMOCO or CIS-BP databases instead?

First, only genes which are present in the GRN are included in the analysis. This means that GRN size influences the outcome of the result. For instance, the GRaNIE naïve macrophage network contains just 2663 TF-gene interactions and consequently, only 9% of the ~7,000 differential genes are evaluated by the R2 metric. If different GRNs have different target genes, this means that not the same expression changes are compared. Different genes (functional categories, promoter sequence content, etc.) may have varying degrees of predictability. Second, the R2 metric is influenced by large outliers. If a few genes with high fold changes have good predictions, this will skew the results. The method of inferring GRNs is very different between ANANSE and GRaNIE. ANANSE uses a single set (or the average of replicates) of expression data and enhancer activity (ATAC- seq and/or H3K27ac) from one cell type to predict the GRN of that specific cell type. However, GRaNIE incorporates co-varying gene-enhancer pairs. This means that GRaNIE needs a large number of input samples to work. All the examples in Kamal et al. use at least ~30-40 samples. Importantly, this will also skew the inferred interactions to genes that change in expression between these samples. In the example of the naïve macrophages, the samples are macrophages from different donors. The regulatory interactions will then be based on genes that change in expression due to variation (i.e. eQTLs) or due to cellular response to growth conditions etc. such as stress. This is visible in the transcription factors with a high number of targets, which include AP1 (Fos) and NFKb (Rel), which are involved in stress response. One other notable set of TFs in the GRaNIE GRN contains factors such as PURA, MBD2, ZBT14, KLF13, E2F7 that have GC-rich motifs and hint towards a promoter/CpG-island based enrichment of the method. It is unlikely that these are relevant.In contrast, ANANSE has a focus on TFs that define or determine cell identity and contains many interactions for genes such as SPI1 and CEBPB. These TFs are necessary and sufficient to differentiate pluripotent cells to macrophages. The GRaNIE network does not capture this and contains only 9 interactions for these factors. Finally, the scopes of the network are very different. The GRaNIE macrophage network consists of 2,663 interactions, the ANANSE network contains all TF to target gene interactions. It is likely that many of the interactions in the GRaNIE network are true positives (i.e. high precision), but many more will be missed. Networks from ANANSE, on the other hand, will likely be much more noisy but can serve as a source to model cellular decisions. This will identify relevant TFs as demonstrated benchmarked in the original ANANSE paper. In short, the method, scope and applicability of the two methods is different. Unsurprisingly, both chose relevant but different benchmarks to evaluate the respective method.
Is the rationale for developing the new software tool clearly explained?YesAre sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
○YesIs the description of the software tool technically sound?○ ○YesIs sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?○