Visualizing balances of compositional data: A new alternative to balance dendrograms

Balances have become a cornerstone of compositional data analysis. However, conceptualizing balances is difficult, especially for high-dimensional data. Most often, investigators visualize balances with the balance dendrogram, but this technique is not necessarily intuitive and does not scale well for large data. This manuscript introduces the 'balance' package for the R programming language. This package visualizes balances of compositional data using an alternative to the balance dendrogram. This alternative contains the same information coded by the balance dendrogram, but projects data on a common scale that facilitates direct comparisons and accommodates high-dimensional data. By stripping the branches from the tree, 'balance' can cleanly visualize any subset of balances without disrupting the interpretation of the remaining balances. As an example, this package is applied to a publicly available meta-genomics data set measuring the relative abundance of 500 microbe taxa.


Introduction
A composition is a vector of positive measurements that sum to an arbitrary total 1 . Examples of compositions include measurements recorded in parts per million (ppm) or percentages, but also include measurements that are less obviously parts of the whole (e.g., count data generated by next-generation sequencing 2 ). A component is one part of a composition. Compositional data analysis (CoDA) deals with the analysis of compositions. Compositional data, because they contain values bounded from zero to one, exist in a non-Euclidean space that render conventional statistical methods invalid. To deal with compositionality, CoDA typically begins with a log-ratio transformation that maps data into an unbounded space where conventional statistical methods can be used. The simplest transformations, the centered log-ratio transformation and the additive log-ratio transformation, use a simple reference as the denominator of the log-ratio. A more complex transformation, the isometric log-ratio transformation, transforms the composition with respect to an orthonormal basis 3 . Alternatively, one could analyze the log-ratio of each component to the other directly 4,5 .
Balances use a sequential binary partition (SBP) to define an orthonormal basis that splits the composition into a series of non-overlapping groups 6 . This design allows for an interpretation of the data at the level of the isometric log-ratio coordinates 7 . This SBP contains a diverging set of contrasts that are each interpretable as a measure of "Group 1 vs. Group 2" (following an isometric log-ratio transformation). For a D-part composition, the SBP defines D − 1 balances that decompose the variance such the sum of the sample-wise variances for each balance in the tree equals the total sample-wise variance 6 . Balances (like the centered log-ratio transformation and the isometric log-ratio transformation) satisfy all properties required for compositional data analysis: scale invariance, permutation invariance, perturbation invariance, and sub-compositional dominance (reviewed in 8 and elsewhere).
Although balances have proved useful for the analysis of compositional data, their usual application depends on generating a meaningful SBP. Sometimes, this involves manually creating an SBP based on expert opinion, with or without the assitance of exploratory analyses 6 . However, using expertise to build an SBP is not always desirable, especially for high-dimensional data (where each composition can measure thousands of components). Principal balance analysis is a data-driven alternative that, similar to principal component analysis, seeks to identify an SBP whose balances successively explain the maximal variance of a data set (a computationally expensive procedure approximated with heuristics) 9,10 . In the field of meta-genomics, where next-generation sequencing is used to count the relative abundance of microbe taxa, scientists have applied balances of SBPs to summarize and classify microbiome samples 11 . One study defined the SBP by hierarchically clustering the microbe taxa based on the outcome of interest 12 . Another defined the SBP based on the phylogenetic relationship between microorganisms 13 .
Once an SBP is generated, its balances can be visualized using a balance dendrogram 14 . The balance dendrogram illustrates (a) the distribution of samples across the balance, (b) the relationship between balances along the SBP tree, and (c) the decomposition of variance 6, 15 . In addition, a balance dendrogram can show differences between sub-groupings of samples by coloring facets of the box plots. Although balance dendrograms capture a vast amount of data, the balance dendrogram may not provide the optimal visualization of balances. First, by building the figure around a tree, balance dendrograms place emphasis on the relationship between the balances, and not on the balances themselves. Second, each box plot has a unique scale positioned sporadically along the tree such that direct comparisons between one balance and all others become difficult. Third, the decomposition of variance uses lines that run parallel to the dendrogram branches, potentially confusing these concepts through use of a common symbol. In this software article, I present the R package balance for visualizing balances of compositional data. This package provides an alternative to the balance dendrogram that I hope will simplify balances for scientists less familiar with compositional data analysis.

Implementation
Within the R package universe, there are three standalone and well-documented tools for general compositional data analysis: compositions 16 , robCompositions 17 , and zCompositions 18 . The compositions:: CoDaDendrogram function plots an archetypal balance dendrogram. There are also a number of domainspecific tools, tailored to next-generation sequencing data, and shown to work effectively 19,20 : ALDEx2 21,22 and ANCOM 23 for differential abundance analysis, SparCC 24 and SPIEC-EASI 25 for the correlation analysis of sparse networks, propr 26,27 for proportionality analysis, and philr 13 for the analysis of phylogeny-based balances. Of these, the philr package computes balances and visualizes them with dendrograms, but does not plot a balance dendrogram per se.
The balance package is available for the R programming language and uses ggplot2 28 to visualize the distribution of samples across balances of a sequential binary partition (SBP) matrix. Each balance is calculated by the formula: is the geometric mean of x, i p is the sub-composition of positivelyvalanced components, and i n is the sub-composition of negatively-valanced components. Here, |i p | describes the norm, or length, of the sub-composition.

housing foodstuffs alcohol other services
Optionally, users can color components or samples based on user-defined groupings. To do this, users must provide a vector of group labels for each component via the d.group argument (or for each sample via the n.group argument). The boxplot.split argument facets the box plots similar to the balance dendrogram 15 .
group <− c(rep("A", 10), rep("B", 10)) res <− balance(expenditures, y1, n.group = group, boxplot.split = TRUE) Figure 1 compares the balance dendrogram to its alternative using the robCompositions data 17 . Figure 1. This figure shows a balance dendrogram and its alternative, both prepared using the data from Table 1 and Table 2. On the left, first branch of the balance dendrogram shows how the "services" and "other" components are contrasted against the remaining components. The box plot positioned at the branch shows the distribution of samples within this balance. The length of trunk shows the proportion of variance explained by this balance. On the right, this same information gets captured by a two-panel figure. The top balance in the left panel shows how the "services" and "other" components are contrasted against the remaining components. The top balance in the right panel shows the distribution of samples within this balance. In the right panel, the line length shows the range of the sample distribution, while its thickness shows the proportion of variance explained. Note that the median of this first contrast sits slightly positive, meaning that the most samples spend more on ["alcohol", "foodstuff", "housing"].

Use cases
As a use case, a publicly available microbiome data set is analyzed using balances. These data measure the abundance of microbe taxa in the feces of diabetics and their non-diabetic relatives 30 , making it a true relative data set. Since these data contain many zeros that disrupt the log-ratio transformations, the zeros are first replaced through imputation by the zCompositions package. See the Supplementary Information for a demonstration of other pre-processing steps.
To identify balances for visualization, a serial binary partition (SBP) matrix is made by hierarchically clustering components based on their proportionality measure φ s (used here as a dissimilarity measure 27 ), thus joining together components that covary similarly across all samples. The ape 31 and philr 13 packages transform the tree object into an SBP ready for analysis and visualization.
# for compositional data with samples as rows data.no0 <− zCompositions::cmultRepl(data, method = "CZM") pr <− propr::propr(data.no0, metric = "phs") h <− hclust(as.dist(pr@matrix)) phylo <− ape::as.phylo(h) sbp <− philr::phylo2sbp(phylo) # it is helpful to name the balances colnames(sbp) <− paste("z", 1:ncol(sbp)) res <− balance::balance(data.no0, sbp, size.text = 4,size.pt = 1) Supplementary Figure  However, unlike a balance dendrogram, components and samples are projected on a common scale that facilitates direct comparisons and accommodates high-dimensional data. Yet, the main advantage of the balance package is that, by stripping the branches from the tree, it becomes possible to visualize any subset of balances without disrupting the interpretation of the remaining balances. In Figure 2, we subset the visualization to include only the top 10 most explanatory balances, ranked by the proportion of variance explained. In Figure 3, we repeat the visualization of the top 10 most explanatory balances, with points colored by the userdefined groupings.

Data availability
All data used for this analysis were acquired from the supplement of Heintz-Buschart et al. 30 . The supplement of this manuscript contains code to pre-process these data and reproduce the analysis.

Software availability
Software and source code available from: https://github.com/tpq/balance Archived source code at time of publication: https://doi.org/10.5281/zenodo.1326860 29 Software license: GPL-2 Author contributions T.P.Q. designed the project, implemented the package, and wrote the manuscript.

Competing interests
No competing interests were disclosed.
I think that this new presentation of balance data is useful in the low-dimensional setting. Unfortunately, the interpretation of these split diagrams appears still difficult in high-dimensional space as shown in sup info example and needs an inspection of each balance sub-space. While representing proportion of variance using line thickness is innovative, it is also difficult to compare between different balances.
The code to do this is based on ggplot2, available and easy to use and adapt to each user needs. Input is a compositional data frame and a binary partition, depending on the role of the component in the balance. 1,2 1 2 a compositional data frame and a binary partition, depending on the role of the component in the balance. Thus, the input must be generated outside of the present code, which only represents the data, which allows for flexibility, and the code is only focused on calculation and representation of pre-specified balances.
I have some minor suggestions to improve the functionality of the code which may facilitate data interpretation: Selection: Author has added the possibility to plot multi-group boxplots. The author already mentions that this could be used for statistical testing, but it would be helpful to add an option select/highlight those balances that have statistically significant differences among groups and/or plot some statistical testing results on the right-hand-plot. Ranking: It looks like balances are ranked on the proportion of variance explained. However, in the provided code/examples, it is unclear whether the default ranking of the balances is the decreasing proportion of variance explained since this is not the case when the option is set to weigh.var TRUE. Also, it would also be useful to rank the balances according to their discrimination power over a response variable. With the standard balance dendrogram, when overlapping datasets, the variance for each of the subgroups could be represented (Pawlowsky-Glahn, Egozcue, Austrian Journal of Statistics,2011). I think this feature is lost in this representation.
The manuscript is well written and easy to follow by the expert reader. I'd like to highlight some minor points.
The author states that these diagrams can accommodate high dimensional data, but it does that by focusing on sub-groups of the high dimensional data, and in the code, these sub-groups are selected previous to the function. Therefore, the proposed code/diagram can really balance accommodate subsets of high dimensional data.
Some details that caught my attention and that may be useful for future development: The name of the main function is . It overlaps with compositions::balance and balance ape::balance. While this is not a hard problem it may add confusion to the function namespace in this kind of analysis. I'd consider it changing it at this stage. When group-based boxplotting, data points are scattered over both (or more) boxplots, it would be clearer if data points were scattered over their own group boxplot. In the Figure S1 example. Some balances show zero values and near-zero variances, probably due to zero over-inflation and the zero-dominance(downstream imputed) balances. This is kind of a white noise in the diagram. Correlation: While it is outside the scope of this work, for the sake of utility, it would be helpful, when there is a continuous response variable, to plot points within the scatter plot within a y sub-axis for each boxplot in such a way that correlation between each of the balances and the response variable is visible and also be able to highlight/select those balances.

Is the description of the software tool technically sound? Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow 1.
One of the issues presented is the representation of the box-plots on a common scale, which is not completely new. It has been previously used at least in two papers, namely Lovell et al. (2013) and Pawlowsky et al. (2015) . Nevertheless, the additional features of visualizing the proportion of explained variance by the thickness of the segment covering the range and by the inclusion of the data as dots can be helpful in understanding the role of each balance. At the same time, in certain circumstances, like with high dimensional data observed in a large sample, it will probably be still difficult to visualize the mentioned features. In such a case, it might be interesting to represent the first principal balances, something already considered in the paper.
The most useful features of compositional dendrograms are (a) the visualization of the decomposition of the total variance into contributions of each balance for one or more populations in the sample; (b) the comparison of mean values between populations; (c) identifying groups of parts as they participate in balances defined by the partition. The proposed software solves point (b) efficiently by comparing box-plots in a homogenized scale. Point (c) can be supplemented including a tool able to enumerate parts in the numerator and denominator of the balance. This is important in high dimensional compositions where the labels in the partition panel are not identifiable (example Figure 2, left panel). Point (a) is more deficiently covered. A partial solution is suggested in section 3) of the paper by colouring some segments in the partition panel according to the value of the variance. However, it seems useful to have a tool allowing alphanumerical output of ordered variances or cuts of the partition tree. For instance, if somebody is looking for linear associations of groups of taxa (Egozcue, Pawlowsky-Glahn and Gloor 2018 ) detecting balances which variance is smaller than a certain threshold, it is useful to visualize those balances by colour in the partition panel, but also to get a list of the parts involved in such balances.
We expect that the presented software, modified as suggested, becomes a useful tool for the analysis of high dimensional compositional data.
Minor issues that need to be revised in the paper are the following: b) The usual representative of the equivalence classes is bounded between 0 and a given constant k. This defines a subset of real space which is not a subspace. Moreover, this set has a Euclidean space structure given by the operations that define the "Aitchison geometry" (Pawlowsky-Glahn and Egozcue, 2001 ). Thus, it is not adequate to say that compositional data exist in a non-Euclidean space.
c) The description of the additive (alr) and centred (clr) log-ratio transformations as "simple" is misleading. The alr defines coordinates in an oblique basis in the above mentioned Aitchison geometry, while the clr leads to coordinates in a generating system and changes with subcompositions. Thus, results of clr components are not subcompositionally coherent. Therefore, interpretation of results is very difficult, as users tend to interpret results in terms of the component in the numerator only, not taking into account the role of the denominator. Furthermore, in many cases results obtained with the alr are not permutation invariant, something that needs to be checked for each method. One of the most striking cases is e.g. regression, where the equation itself is permutation invariant, but not so the goodness-of-fit criteria.
d) The suggested alternative analysis in terms of simple log-ratios is also not simple at all, as they 3.

4.
d) The suggested alternative analysis in terms of simple log-ratios is also not simple at all, as they lead to the most general models, i.e. general log-contrast. The exponents involved in such a log-contrast are in general different for each part of the composition.

Methods
The equation given for computing balances is not correctly described. The term |i | does not describe the norm or length of the sub-composition, but the number of parts in the sub-composition.

Use cases
a) The optional line width of the range of box-plots to illustrate the proportion of variance explained by a balance is not really informative in the case of high-dimensional data. In the low-dimensional case we think the dendrogram is more informative, as one can recognise easily if the balance that explains the largest proportion of variance corresponds to the first steps of the partition, involving thus a large number of parts or, on the contrary, involves only a small number of parts. This deficiency can be mitigated by colouring bars in the partition panel. For instance, plotting in red the lines corresponding to a given probability quantile of large variances and in blue for a probability quantile range of small variances. b) Figure 2 shows two limitations of the proposed visualisation. i) In the left panel it is clear which taxa are involved in each balance, but not which taxa are in the numerator, and which of those are in the denominator. Perhaps a good alternative would be to use different colours for each group, or to reorder the taxa in such a way that those in the numerator are always in the left hand side and a vertical bar indicates the dividing point.
ii) A numeration of the balances according to the larger (smaller) explained variance would be helpful in recognising rapidly which balance is the most (the less) informative in this sense.
Summary a) It is not always true that "log-ratio transformations offer a way to transform the data into an unbounded space where the analyst can apply conventional statistical method". For this to be true you need at least the transformation to be an isometry. For example, the alr is not an isometry, and thus conventional statistical methods should not be applied blindly.
b) Balances is a particular case of isometric log-ratio transformation. Another example is given by general log-contrasts obtained as coordinates in compositional principal component analysis.

Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool? Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article? Partly No competing interests were disclosed.

Competing Interests:
Referee Expertise: Statistics -compositional data analysis We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.