TreeSummarizedExperiment: a S4 class for data with hierarchical structure

Data organized into hierarchical structures (e.g., phylogenies or cell types) arises in several biological fields. It is therefore of interest to have data containers that store the hierarchical structure together with the biological profile data, and provide functions to easily access or manipulate data at different resolutions. Here, we present TreeSummarizedExperiment, a R/S4 class that extends the commonly used SingleCellExperiment class by incorporating tree representations of rows and/or columns (represented by objects of the phylo class). It follows the convention of the SummarizedExperiment class, while providing links between the assays and the nodes of a tree to allow data manipulation at arbitrary levels of the tree. The package is designed to be extensible, allowing new functions on the tree (phylo) to be contributed. As the work is based on the SingleCellExperiment class and the phylo class, both of which are popular classes used in many R packages, it is expected to be able to interact seamlessly with many other tools.


Introduction
Biological data arranged into a hierarchy occurs in several fields. A notable example is in microbial survey studies, where the microbiome is profiled with amplicon sequencing or whole genome shotgun sequencing, and microbial taxa are organized as a tree according to their similarities in the genomic sequence or the evolutionary history. Also, a tree might be used in single cell cytometry or RNA-seq data, with nodes representing cell subpopulations at different granularities 1 . Currently, phyloseq 2 and SingleCellExperiment 3 are popular classes used in the analysis of microbial data and single cell data, respectively. The former supports the information pertaining to the hierarchical structure that is available as the phylo class (e.g., phylogenetic tree), and the latter is derived from the SummarizedExperiment class (defined in the SummarizedExperiment package 4 ), which is widely used as a standardized container across many Bioconductor packages. Since the data structures in these fields share similarities, we were motivated to develop an S4 class 5 , TreeSummarizedExperiment, that not only leverages the facilities from the SummarizedExperiment class, but also bridges the functionality from the phylo class, which is available from the ape 6 package and has been imported in more than 200 R packages.
We define TreeSummarizedExperiment by extending the SingleCellExperiment class, so that it is a member of the SummarizedExperiment family, and thus benefits from the comprehensive Bioconductor ecosystem (e.g., iSEE 7 , SEtools 8 , and ggbio 9 ). At the same time, all slots of the phyloseq class have their corresponding slots in the TreeSummarizedExperiment class, which enables convenient conversion between these classes. Furthermore, we allow the link between profile data and nodes of the tree, including leaves and internal nodes, which is useful for algorithms in the downstream analysis that need to access internal nodes of the tree (e.g., treeclimbR 1 ).
Overall, the class TreeSummarizedExperiment is provided as a standalone R package, analogous to SummarizedExperiment and SingleCellExperiment. Thus, it is convenient for R package developers to import it and build downstream data analyses or visualizations on it. Also, it is flexible to combine with R packages that are linked to the SummarizedExperiment family or the phylo tree class, which enables R package users to explore data with the support of other tools.

Implementation
The structure of TreeSummarizedExperiment. The structure of the TreeSummarizedExperiment class is shown in Figure 1.
Compared to the SingleCellExperiment objects, TreeSummarizedExperiment has five additional slots: • rowTree: the hierarchical structure on the rows of the assays.
• rowLinks: the link information between rows of the assays and the rowTree.
• colTree: the hierarchical structure on the columns of the assays.
• colLinks: the link information between columns of the assays and the colTree.
• referenceSeq (optional): the sequence information for rows of the assays.
The rowTree and/or colTree can be left empty (NULL) if no trees are available; in this case, the rowLinks and/or colLinks are also set to NULL. The referenceSeq is an optional slot to store the sequence data of Amendments from Version 1 TreeSummarizedExperiment (TSE) now allows rowTree() and colTree() to work as both setters and getters, provides a new slot referenceSeq() to store sequence information, and replaces aggValue with aggTSE to provide more flexible data aggregation. The combination of multiple TSE objects is enabled, for which a new column whichTree is added in LinkDataFrame for rowLinks()/colLinks() to register which rows and columns are mapped to which trees in rowTree() & colTree(). Also, an example analysis of CyTOF data is added as a new use case of TreeSummarizedExperiment. This necessarily added new commands and text to describe new features of TSE. Otherwise, all text and figures have remained the same.
Figures and Tables: Figure 1: a new slot referenceSeq is added, Table 1: three new functions showNode, addLabel, joinNode are added, Figure 6, 7, 8: The issue about the cut out of legends is fixed, Figure 12 The rowTree and colTree slots require the tree to be an object of the phylo class. If a tree is available in an alternative format, it can often be converted to a phylo object using dedicated R packages (e.g., treeio 10 ).
Functions in the TreeSummarizedExperiment package fall in two main categories: operations on the TreeSum-marizedExperiment object or operations on the tree (phylo) objects. The former includes constructors and accessors, and the latter serves as "components" to be assembled as accessors or functions that manipulate the TreeSummarizedExperiment object. Given that more than 200 R packages make use of the phylo class, there are many resources (e.g., ape) for users to manipulate the small "pieces" in addition to those provided in TreeSummarizedExperiment.
The toy datasets as the example data We generate a toy dataset that has observations of 6 entities collected from 4 samples as an example to show how to construct a TreeSummarizedExperiment object.

## entity5
The information of entities and samples are given in the row_data and col_data, respectively.

The construction of TreeSummarizedExperiment
The TreeSummarizedExperiment class is used to store the toy data generated in the previous section: assay_data, row_data, col_data, col_tree and row_tree. To correctly store data, the link information between the rows (or columns) of assay_data and the nodes of the row_tree (or col_tree) can be provided via a character vector rowNodeLab (or colNodeLab), with length equal to the number of rows (or columns) of the assays; otherwise the row (or column) names are used. Tree data takes precedence to determine entities included during the creation of the TreeSummarizedExperiment object; columns and rows with labels that are not present among the node labels of the tree are removed with warnings. The link data between the assays tables and the tree data is automatically generated during the construction.
The row and column trees can be included simultaneously during the construction of a TreeSummarized-Experiment object. Here, the column names of assay_data can be found in the node labels of the column tree, which enables the link to be created between the column dimension of assay_data and the column tree col_tree. If the row names of assay_data are not in the node labels of row_tree, we would need to provide their corresponding node labels (row_lab) to rowNodeLab in the construction of the object. It is possible to map multiple rows or columns to a node, for example, the same leaf label is used for the first two rows in row_lab.
# all column names could be found in the node labels of the column tree all(colnames(assay_data) %in% c(col_tree$tip.label, col_tree$node.label))

The accessor functions
Slots inherited from the SummarizedExperiment class can be accessed in the standard way (e.g., via accessors assays(), rowData(), colData() and metadata()). These functions are both getters and setters. To clarify, getters and setters are functions for users to retrieve and to overwrite data from the corresponding slots, respectively. Here, accessors for TreeSummarizedExperiment are both getters and setters unless specifically mentioned.
For new slots, we provide rowTree (and colTree) to access the row (column) trees, and rowLinks (and colLinks) as getters to retrieve the link information between assays and the row (column) tree. If the tree is not available, the corresponding link data is NULL. The link data objects are of the LinkDataFrame class, which extends the DataFrame class from S4Vectors with the restriction that it has five columns: • nodeLab: the labels of nodes on the tree The subsetting function A TreeSummarizedExperiment object can be subset in two different ways: [ to subset by rows or columns, and subsetByNode to retrieve row and/or columns that correspond to nodes of a tree. As the numeric ID of a node changes with the cut of a phylo tree, to keep track of the original data, we do not prune the tree structure in the subsetting. Below, we can see that rowLinks and rowData are updated to have the same number of rows as assays. Subsetting simultaneously in both dimensions is also allowed.

Aggregation
Since it may be of interest to report or analyze observed data at multiple resolutions based on the provided tree(s), the TreeSummarizedExperiment package offers functionality to flexibly aggregate data to arbitrary levels of a tree.

The column dimension.
Here, we demonstrate the aggregation functionality along the column dimension. The desired aggregation level is given in the colLevel argument, which can be specified using node labels (orange text in Figure 3) or node numbers (blue text in Figure 3). Furthermore, the summarization method used to aggregate multiple values can be specified via the argument colFun.
# use node labels to specify colLevel agg_col <-aggTSE(x = taxa_tse, colLevel = c("GroupA", "GroupB"), colFUN = sum) # or use node numbers to specify colLevel agg_col <-aggTSE(x = taxa_tse, colLevel = c(6, 7), colFUN = sum) assays(agg_col) [ The colLinks is also updated to link the new rows of assays tables to the corresponding nodes of the column tree ( Figure 3). Both dimensions. The aggregation on both dimensions could be performed in one step, in which case users can specify the order of aggregation; either rows first (rowFirst = TRUE) or columns first (rowFirst = FALSE). The aggregate functions for the row and the column dimension can be provided via rowFun and colFun, respectively. Additionally, parallel computation is enable by providing BPPARAM with a BiocParallelParam object.
agg_both <-aggTSE(x = both_tse, rowLevel = 7:9, rowFun = sum, colLevel = 6:7, colFun = mean, rowFirst = FALSE) As expected, we obtain a table with 3 rows representing the aggregated row nodes 7, 8 and 9 (rowLevel = 7:9) and 2 columns representing the aggregated column nodes 6 and 7 (colLevel = 6:7). Functions operating on the phylo object Next, we highlight some functions to manipulate and/or to extract information from the phylo object. Further operations can be found in other packages, such as ape 6 , tidytree 12 . These functions are useful for users who wish to develop more functions for the TreeSummarizedExperiment class.
To show these functions, we use the tree shown in Figure 5.

Conversion of the node label and the node number
The translation between the node labels and node numbers can be achieved by the function convertNode.
convertNode(tree = tinyTree, node = c(12, 1, 4)) ## [1] "Node_12" "t2" "t9" convertNode(tree = tinyTree, node = c("t4", "Node_18")) ## More functions. We list some functions that might also be useful in Table 1. More functions are available in the package, and we encourage users to develop and contribute their own functions to the package. Here, convertNode and trackNode are available in TreeSummarizedExperiment, and keep.tip is from the ape package. Since the numeric identifier of a node is changed after pruning a tree with keep.tip, trackNode is provided to track the node and further update links between the rectangular assay matrices and the new tree.

Operation
The TreeSummarizedExperiment package can be installed by following the standard installation procedures of Bioconductor packages.
# install BiocManager if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") # install TreeSummarizedExperiment package BiocManager::install("TreeSummarizedExperiment") Minimum system requirements is R version 3.6 (or later) on a Mac, Windows or Linux system. The version of TreeSummarizedExperiment should be later than 1.99.10, which is available in Bioconductor 3.13 Use cases HMP 16S rRNA-seq data To demonstrate the functionality of TreeSummarizedExperiment, we use it to store and manipulate a microbial dataset. We further show exploratory graphics using the available functions designed for SummarizedExperiment objects in other packages (e.g., scater), or customized functions from popular visualization packages (e.g.,  (7): RSID VISITNO ... HMP_BODY_SUBSITE SRS_SAMPLE_ID # name the assay names(assays(v35)) <-"Count" The storage of HMP 16S rRNA-seq data We store the phylogenetic tree as the rowTree. Links between nodes of the tree and rows of assays are automatically generated in the construction of the TreeSummarizedExperiment object, and are stored as rowLinks. Rows of the assays matrices that do not have a match to nodes of the tree are removed with warnings.

Dimensionality reduction
We visualize samples in reduced dimensions to see whether those from the same body site are similar to each other. Three dimensionality reduction techniques are available in the package scater (v. 1.16.2), including principal component analysis (PCA) 15 , t-distributed Stochastic Neighbor Embedding (t-SNE) 16 , and uniform manifold approximation and projection (UMAP) 17 . Since TreeSummarizedExperiment extends the SingleCellExperiment class, functions from scater 18 can be used directly. Here, we first apply PCA and t-SNE on data at the OTU level, and select the one better clustering the samples to apply on data aggregated at coarser taxonomic levels to see whether the resolution affects the separation of samples.

PCA and t-SNE at the OTU level
The PCA is performed on the log-transformed counts that are stored in the assays matrix with the name logcounts. In practice, data normalization is usually applied prior to the downstream analysis, to address bias or noise introduced during the sampling or sequencing process (e.g., uneven sampling depth). Here, the library size is highly variable (Figure 7) and non-zero OTUs vary across body sites. It is difficult to say what is the optimal normalization strategy, and the use of an inappropriate normalization method might introduce new biases. The discussion of normalization is outside the scope of this work. To keep it simple, we will visualize data without further normalization.
In Figure 8, we see that the Oral samples are distinct from those of other body sites. Samples from Skin, Urogenital Tract, Airways and Gastrointestinal Tract are not separated very well in the first two principal components.

t-SNE on broader taxonomic levels
To organize data at different taxonomic levels, we first replace the phylogenetic tree with the taxonomic tree that is generated from the taxonomic table. Due to the existence of polyphyletic groups, a tree structure cannot be generated. For example, the Alteromonadaceae family is from different orders: Alteromonadales and Oceanospirillales. # taxonomic tree tax_order <-c("SUPERKINGDOM", "PHYLUM", "CLASS", "ORDER", "FAMILY", "GENUS", "CONSENSUS_LINEAGE") tax_0 <-data.frame(rowData(tse_phy) [ To resolve the loops, we add a suffix to the polyphyletic genus with resolveLoop. For example, Ruminococcus belonging to the Lachnospiraceae and the Ruminococcaceae families become Ruminococcus_1 and Ruminococcus_2, respectively. A phylo tree is created afterwards using toTree. tax_1 <-resolveLoop(tax_tab = tax_0) tax_tree <-toTree(data = tax_1) # change the tree tse_tax <-changeTree(x = tse_phy, rowTree = tax_tree, rowNodeLab = rowData(tse_phy)$CONSENSUS_LINEAGE) The separation of samples from different body sites appears to be worse when the data on broader resolution is used (Figure 11).

CyTOF data
Here, a mass cytometry (CyTOF) dataset 19 is used to show the application of TreeSummarizedExperiment on single cell data. The data was available initially as a SummarizedExperiment object, and became a TreeSummarizedExperient object after the incorporation of a tree on cells. Data was then aggregated along nodes of the tree to provide data at different resolutions. The data visualization was finally performed as heatmaps along with the tree using the R package ggtree. We preprocess the data and cluster cells using the workflow from the diffcyt package 21,22 . According to the median expressions of lineage markers per cluster, a tree cytof_hclust is then constructed by applying the hierarchical clustering on the cell cluster level, using only the "type" gene.

Data aggregation
We split the data into two TreeSummarizedExperiment objects: lse_type for lineage markers and lse_state for functional markers, to perform aggregation in different ways. For lse_type, the marker median expression is calculated over all samples to compare expression patterns of lineage markers across all cell clusters. For lse_state, the marker median expression is computed on individual samples, to enable comparison between stimulated and unstimulated samples across clusters.

## [1] 1552 14
After aggregation, tse_type and tse_state have 97 and 1552 rows, respectively. The former has each row representing a cell cluster that is mapped to a node of the tree; the latter has each row representing a cell cluster in a sample.
In the downloaded data, cells are annotated with cell types (population_id in rowData()). As clustering is not perfect, cells within a cluster are not expected to have exactly the same cell type. Therefore, we would like to annotate a cell cluster with the cell type that the majority of cells (> 60%) belong to, or mixed if none of cell types has more than > 60% cells. Note, internal nodes of the tree cytof_tree are considered as cell clusters at broader resolution than leaf nodes. To annotate an internal node, we need to first find all cells that are mapped to its descendant leaves, and then take the cell type shared by its majority of cells.

Visualization
The cytof_tree tree is considered as a hierarchical structure organizing cell clusters at different granularities. So, an internal node is a cell cluster that incorporates several cell clusters represented by its descendant leaves. Here, we are interested in exploring the expression profile of markers at different resolutions.

Summary
TreeSummarizedExperiment is an S4 class in the family of SummarizedExperiment classes, which enables it to work seamlessly with many other packages in Bioconductor. It integrates the SummarizedExperiment and the phylo class, facilitating data access or manipulation at different resolutions of the hierarchical structure. By providing additional functions for the phylo class, we support users to customize functions for the TreeSummarizedExperiment class in their workflows.

Data availability
Underlying data Human Microbiome Project data (v35) and mass cytometry (CyTOF) dataset 19 were used for the presented use cases. They can be downloaded using the R package HMP16SData 14 and HDCytoData 20 , respectively.

Software availability
The TreeSummarizedExperiment package is available at: Author information RH developed the software with contributions from FGME. All authors participated in the conceptualization of the software. RH, CS and MDR drafted the manuscript with review and editing from KCRA, SCH, GY and FGME. All authors read and approved the final manuscript.

Open Peer Review
with the authors of this manuscript. This had not influenced my original review, and I think the feedback in that original review has been adequately addressed now.

Matthew Ritchie
The Walter and Eliza Hall Institute of Medical Research, Parkville, Vic, Australia Huang et al. describe the TreeSummarizedExperiment package, which provides well-designed S4 infrastructure that couples the phylo and SingleCellExperiment classes to create a container for high-throughput data that can be organised in a tree-like structure.
The article is structured like a vignette, providing an overview of the class ( Figure 1) and stepping the reader through the process of setting up a TreeSummarizedExperiment object and accessing and assigning data to its various slots, firstly for a toy data set and then for data from the Human Microbiome Project.
The article is very clearly written, and the authors demonstrate the ability to use TreeSummarizedExperiment objects in conjunction with other established software for dealing with trees (e.g. ggtree and tidyTree) or dimensionality reduction of high-throughout data (e.g. scater). One topic that I was interested to read more about was its potential use in a single cell RNA-seq analysis. Perhaps use cases for such applications can be added as future work. The TreeSummarizedExperiment package has been available from Bioconductor since May 2019 and it has been downloaded > 2.4K times, which indicates it is being taken up by the community. .0], a Bioconductor package aimed at providing an S4 class for omics data with hierarchical tree structure. The TreeSummarizedExperiment class builds on the popular SingleCellExperiment object class, with additional slots in which hierarchical structure, in the form of phylo objects, can be added for features (rows) and observations (columns). In addition to the object class, the package contains several functions for manipulating these objects, ranging from getting/setting/resetting the tree slots, aggregating across rows and/or columns, and various analytical tasks operating on the phylo objects.
The article is well written with clear motivation and description of the package, and addresses an important problem of performing analysis of high dimensional hierarchically structured data using object-oriented programming. I have a few further comments and questions that may improve the breadth of use of TreeSummarizedExperiment by the research community. Is there a way to simply include an argument for aggValue() that would swap the order to columns first and rows second, rather than requiring the user to perform two distinct operations?

○
The new slots, rowTree, colTree, rowLinks and colLinks are 'getter' accessors but not currently 'setter' functions. I can imagine a popular use-case among users with an already constructed object of class SummarizedExperiment or SingleCellExperiment would be to simply use as(, "TreeSummarizedExperiment") and then attempt to add the additional slots, for example as the output of hclust( The new slots, rowTree, colTree, rowLinks and colLinks are 'getter' accessors but not currently 'setter' functions. I can imagine a popular use-case among users with an already constructed object of class SummarizedExperiment or SingleCellExperiment would be to simply use as(, "TreeSummarizedExperiment") and then attempt to add the additional slots, for example as the output of hclust(). I would suggest prioritising converting these functions to both 'getter' and 'setter', or perhaps adding a constructor usage for TreeSummarizedExperiment for objects that are already SummarizedExperiment or SingleCellExperiment, if possible.
rowTree and colTree are now both setters and getters. When the row/column tree is replaced, the rowLinks/colLinks is updated automatically. To avoid breaking links between assays and trees, we don't recommend users to modify the rowLinks/colLinks data. Therefore, rowLinks/colLinks are still kept as getters. 2.

I'm interested in how TreeSummarizedExperiment would work in the case
where the hierarchical structure is not a typical single tree, but comprising of multiple distinct tree structures. An example of such is single cell (or single clone) lineage data where there exists a tree structure within each experimental condition, but not between cells from different conditions. Would the colTree slot correspond to a list of trees in this case?
Yes, it's possible to have a list of trees in the rowTree/colTree. In the rowLinks/colLinks, we have added a new column (whichTree) to give information about which row/column tree a row/column is mapped to. We have also added a new vignette describing how to combine multiple TSEs. ( https://www.bioconductor.org/packages/devel/bioc/vignettes/TreeSummarizedExperiment/inst/doc/T )

3.
How would one go about combining different TreeSummarizedExperiment objects? Do the typical cbind() and rbind() operations have meaning here? In which cases are they not to be used?
rbind() and cbind() are now implemented for TreeSummarizedExperiment objects. To rbind() multiple TSEs successfully, it's required that the TSEs agree in the column dimension to have the same colTree() and colLinks(). Similarly, cbind() would require TSEs to have the same rowTree() and rowLinks(). More detailed information is available in the new vignette about combining multiple TSEs. ( https://www.bioconductor.org/packages/devel/bioc/vignettes/TreeSummarizedExperiment/inst/doc/T )

4.
I would be interested in getting to a clustered heatmap as an example of visualisation for the TreeSummarizedExperiment, either implemented using ggplot2/ggtree, or other packages like ComplexHeatmap?

5.
We have added a new use case of TSE on CyTOF data, and customized a visualization function based on ggtree, ggplot2 and ggnewscale to plot a clustered heatmap.
How would the tree structure information storage scale in terms of the number of rows and columns, or in the hierarchical structures.
We store the tree structure as a phylo object. The size of a phylo object is quite small even for a tree with 10 6 leaves (about 90 Mb). To set up the link between rows/columns to a tree, it takes only a few seconds even for 10 6 rows to a tree with 10 6 leaves.
The typo and the legend cut off are fixed.

7.
Competing Interests: No competing interests were disclosed.
Reviewer Report 19 October 2020 https://doi.org/10.5256/f1000research.29440.r73188 © 2020 Lahti L. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Leo Lahti
Department of Computing, Faculty of Technology, University of Turku, Turku, Finland This software article introduces TreeSummarizedExperiment, a S4 class for hierarchically structured data in R. This provides a very generic data structure that serves for instance the single cell and microbiome bioinformatics communities, and has already gathered remarkable attention with a growing user base. The package is mature and has been available via Bioconductor for some time already.
The rationale for developing the new software tool has been clearly explained, and sufficient examples are provided. It extends the popular SingleCellExperiment class structure by bringing in tree info on data rows and cols (based on the phylo class). The new extended class has potentially many new applications, for instance in microbiome research; concrete examples are provided. The new class combines and extends other common class structures, which is very beneficial for the overall compatibility. Many tools for manipulation and use already exist based for instance on related work on the SummarizedExperiment family of classes, phylo tree structure, and the phyloseq class.
The overall description of the software is technically sound and follows standard conventions in the R/Bioconducor community. Sufficient details have been provided to allow replication of the expertise to confirm that it is of an acceptable scientific standard. For the new slots, the data manipulation depends on the functions that users have applied on the tree object (of class phylo). These functions might be from TSE or outside TSE. For functions from TSE, either those working on the phylo tree (e.g., findDescendant, convertNode, matTree, addLabel) or those working on TSE (e.g., rowTree, colTree, rowLinks, colLinks, changeTree), takes only seconds even for a tree with up to 100,000 nodes.

1.
How easy it would be to incorporate further supporting information on the rows and columns, for instance on DNA/RNA sequence information?
TSE now has a slot referenceSeq to store the sequence information of features ( rows).

2.
The class is very generic; is the idea that this package can be used as such in (hierarchical) single-cell experiments, microbiome research, and potentially other fields that have little overlap currently? Or is this package meant to be a fundamental structure that can be further extended in more specific application domains? Some more discussion on these aspects could help to contextualize the new class.
Currently, there is not much overlap in the community across fields, e.g, single-cell experiments, microbiome research. But, we do see that they share similarities in data structures, and can potentially share synergies in data visualization or analysis. We provide TSE as a standalone R package like SummarizedExperiment and SingleCellExperiment, and propose it as a convenient starting point to create R packages for downstream analysis or visualization of data with tree structures. We are open to update our work or receive pull requests if new features (or slots) required in a specific field are feasible to be integrated to TreeSummarizedExperiment. For example, a new optional slot referenceSeq(), which was requested mainly for microbiome data to store RNA/DNA sequencing information, has been developed by Félix G.M. Ernst, and the PR has been accepted in TreeSummarizedExperiment. 3.