DRIMSeq: a Dirichlet-multinomial framework for multivariate count outcomes in genomics

There are many instances in genomics data analyses where measurements are made on a multivariate response. For example, alternative splicing can lead to multiple expressed isoforms from the same primary transcript. There are situations where differences (e.g. between normal and disease state) in the relative ratio of expressed isoforms may have significant phenotypic consequences or lead to prognostic capabilities. Similarly, knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, so-called splicing quantitative trait loci (sQTL) will help to characterize the effects of genetic variation on gene expression. RNA sequencing (RNA-seq) has provided an attractive toolbox to carefully unravel alternative splicing outcomes and recently, fast and accurate methods for transcript quantification have become available. We propose a statistical framework based on the Dirichlet-multinomial distribution that can discover changes in isoform usage between conditions and SNPs that affect relative expression of transcripts using these quantifications. The Dirichlet-multinomial model naturally accounts for the differential gene expression without losing information about overall gene abundance and by joint modeling of isoform expression, it has the capability to account for their correlated nature. The main challenge in this approach is to get robust estimates of model parameters with limited numbers of replicates. We approach this by sharing information and show that our method improves on existing approaches in terms of standard statistical performance metrics. The framework is applicable to other multivariate scenarios, such as Poly-A-seq or where beta-binomial models have been applied (e.g., differential DNA methylation). Our method is available as a Bioconductor R package called DRIMSeq.

This article is included in the Bioconductor channel. In version 2 of the manuscript, we have reworded sections in the Introduction to clarify the scope of existing methods, with respect to the term 'differential splicing'. We have added additional analyses for differential splicing analyses, to better understand how the null P-value distributions compare across different simulation scenarios and dispersion estimators. For the detected tuQTLs, we added an analysis with respect to enrichment of splicing-related features.

Introduction
With the development of digital high-throughput sequencing technologies, the analysis of count data in genomics has become an important theme motivating the investigation of new, more powerful and robust approaches that handle complex overdispersion patterns while accommodating the typical small numbers of experimental units.
The basic distribution for modeling univariate count responses is the Poisson distribution, which also approximates the binomial distribution. One important limitation of the Poisson distribution is that the mean is equal to the variance, which is not sufficient for modeling, for example, gene expression from RNA sequencing (RNA-seq) data where the variance is higher than the mean due to technical sources and biological variability [1][2][3][4][5] . A natural extension of the Poisson distribution that accounts for overdispersion is the negative-binomial distribution, which has been extensively studied in the small-sample situation and has become an essential tool in genomics applications [1][2][3] .
Analogously, the fundamental distribution for modeling multivariate count data is the multinomial distribution, which models proportions across multiple features. To account for overdispersion, the multinomial can be extended to the Dirichlet-multinomial (DM) distribution 6 . Because of its flexibility, the DM distribution has found applications in forensic genetics 7 , microbiome data analysis 8 , the analysis of single-cell data 9 and for identifying nucleosome positions 10 . Another extension of the multinomial is the Dirichlet negative multinomial distribution 11 , which allows modeling of correlated count data and was applied in the analysis of clinical trial recruitment 12 . Notably, the beta-binomial distribution, such as those used in differential methylation from bisulphite sequencing data 13-15 , represent a special case of the DM.
Genes may express diverse transcript isoforms (mRNA variants) as a consequence of alternative splicing or due to the differences in transcription start sites and polyadenylation sites 16 . Hence, gene expression can be viewed as a multivariate expression of transcripts or exons and such a representation allows the study of not only the overall gene expression, but also the expressed variant composition. Differences in the relative expression of isoforms can have significant phenotypic consequences and aberrations are associated with disease 17,18 . Thus, biologists are interested in using RNA-seq data to discover differences in transcript usage between conditions or to study the specific molecular mechanisms that mediate these changes, for example, alternative splice site usage. In general terms, we collect all these together under the term "differential splicing" (DS) 19 .
Alternative splicing is a process regulated by complex protein-RNA interactions that can be altered by genetic variation. Knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, known as splicing quantitative trait loci (sQTL), can help to characterize this layer of regulation.
In this article, we propose the DM distribution to model relative usage of isoforms. The DM model treats transcript expression as a multivariate response and allows for flexible small-sample estimation of overdispersion. We address the challenge of obtaining robust estimates of the model parameters, especially dispersion, when only a small number of replicates is available by applying an empirical Bayes approach to share information, similar to those proven successful in negative-binomial frameworks 1,20 . In particular, weighted likelihood is used to moderate the gene-wise dispersion toward a common or trended value.
The Dirichlet-multinomial framework, implemented as a Bioconductor R package called DRIMSeq, is applicable to both differential transcript usage (DTU) analysis between conditions and transcript usage quantitative trait loci (tuQTL) analysis. It has been evaluated and compared to the current best methods in extensive simulations and in real RNA-seq data analysis using transcript and exon counts, highlighting that DRIMSeq performs best with transcript counts. Furthermore, the framework can be applied to other types of emerging multivariate genomic data, such as PolyA-seq where the collection of polyadenylated sites for a given gene are measured 21 and to settings where the beta-binomial is already applied (e.g., differential methylation, allele-specific differential gene expression).

Approaches to DS and sQTL analyses
RNA-seq has provided an attractive toolbox to unravel alternative splicing outcomes. There are various methods designed explicitly to detect DS based on samples from different experimental conditions 19,22,23 . Independently, a set of methods was developed for detecting genetic variation associated with changes in splicing (sQTLs). While sQTL detection represents a different application, it is essentially DS between groups defined by genotypes. In the following overview, we do not distinguish between applications but rather between the general concepts used to detect differences in splicing.
DS can be studied in three main ways: as differential transcript usage (DTU) or, in a more local context, as differential exon or exon junction usage (DEU) or as specific splicing events (e.g., exon skipping), and all have their advantages and disadvantages. A survey of the main methods can be found in it remains a complex undertaking to quantify isoform expression from short cDNA fragments since there is a high degree of overlap between transcripts in complex genes; this is a limitation of the technology, not the algorithms. In the case of incomplete transcript annotation, local approaches may be more robust and can detect differential changes due to transcripts that are not in the catalog 23,27 . Nevertheless, DS at the resolution of isoforms is the ultimate goal within the DRIMSeq framework, and with the emergence of longer reads (fragments), transcript quantifications will become more accurate and methods for multivariate transcript abundances will be needed.
Whether the differential analysis is done at the transcript or local level, modeling and testing independently each transcript 46,47 or exon ratio 48 ignores the correlated structure of these quantities (e.g., proportions must sum to 1 Variability among replicates is captured within Cuffdiff via the Jensen-Shannon divergence metric on probability distributions of isoform proportions as a measure of changes in isoform relative abundances between samples. sQTLseekeR tests for the association between genotype and transcript composition, using an approach similar to a multivariate analysis of variance (MANOVA) without assuming any probabilistic distribution and Hellinger distance as a dissimilarity measure between transcript ratios. Very recently, LeafCutter 52 gives intron usage quantifications that can be used for both DS analyses (also using the DM model) and sQTL analyses via a correlation-based approach with FastQTL 50 .
sQTLseekeR, Altrans, LeafCutter and other earlier methods for the sQTL analysis 35,46-48 employ feature ratios to account for the overall gene expression. A potential drawback of this approach is that feature ratios do not take into account whether they are based on high or low expression, while the latter have more uncertainty in them. DRIMSeq naturally builds this in via the multinomial model.

Dirichlet-multinomial model for relative transcript usage
In the application of the DM model to DS, we refer to features of a gene. These features can be transcripts, exons, exonic bins or other multivariate measurable units, which for DS, contain information about isoform usage and can be quantified with (estimated) counts.
Assume that a gene has q features with relative expression defined by a vector of proportions π = (π 1 ,…,π q ), and the feature counts Here, m is treated as an ancillary statistic since it depends on the sequencing depth and gene

Detecting DTU and tuQTLs with the Dirichlet-multinomial model
Within DRIMSeq, the DM method can be used to detect the differential usage of gene features between two or more conditions. For simplicity, suppose that features of a gene are transcripts and the comparison is done between two groups. The aim is to determine whether transcript ratios of a gene are different in these two conditions. Formally, we want to test the hypothesis H 0 : π 1 = π 2 against the alternative H 1 : π 1 ≠ π 2 . For the convenience of parameter estimation, we decide to use the DM parameterization with precision parameter γ + , which can take any non-negative value, instead of dispersion parameter θ, which is bounded to values between 0 and 1. Because our goal is to compare the proportions from two groups, γ + is a nuisance parameter that gets estimated in the first step (see the following Section). Let l(π 1 , π 2 , γ + ) be the joint log-likelihood function. Assuming γ + =γ ^+ , the maximum likelihood (ML) estimates of π 1 , π 2 are the solution of ( π π π π π π π π (5) which asymptotically follows the chi-squared distribution 2 -1 q χ with q − 1 degrees of freedom. In comparisons across c groups, the number of degrees of freedom is (c − 1) × (q − 1). After all genes are tested, p-values can be adjusted for multiple comparisons with the Benjamini-Hochberg method.
In a DTU analysis, groups are defined by the design of an experiment and are the same for each gene. In tuQTL analyses, the aim is to find nearby (bi-allelic) SNPs associated with transcript usage of a gene. Model fitting and testing is performed for each gene-SNP pair, and grouping of samples is defined by the genotype, typically translated into the number of minor alleles (0, 1 or 2). Thus, tuQTL analyses are similar to DTU analyses with the difference that multiple models are fitted and tested for each gene. Additional challenges to be handled in tuQTL analyses include a large number of tests per gene with highly variable allele frequencies (models) and linkage disequilibrium, which can be accounted for in the multiple testing corrections. As in other sQTL studies 35,49,50 , we apply a permutation approach to empirically assess the null distribution of associations and use it for the adjustment of nominal p-values (see Supplementary Note 2 in Supplementary File). For computational efficiency, SNPs within a given gene that exhibit the same genotypes are grouped into blocks. In this way, blocks define unique models to be fit, reducing computation and the degree of multiple testing correction.

Dispersion estimation with adjusted profile likelihood and moderation
Accurate parameter estimation is a challenge when only a small number of replicates is available. Following the edgeR strategy 1,2,53 , we propose multiple approaches for dispersion estimation, all based on the maximization and adjustment of the profile likelihood, since standard maximum likelihood (ML) is known to produce biased estimates as it tends to underestimate variance parameters by not expression, but not on the model parameters. The simplest way to model feature counts is with the multinomial distribution with probability function defined as: where the mean and the covariance matrix of Y are (Y) = mπ and (Y) = diag(π) -ππ T , respectively.
To account for overdispersion due to true biological variation between experimental units as well as technical variation, such as library preparation and errors in transcript quantification, we assume the feature proportions, Π, follow the (conjugate) Dirichlet distribution, with density function: where γ j , j = 1, …, q are the Dirichlet parameters and + = γ = γ ∑ 1 .
We can derive the marginal distribution of Y by multiplying densities (1) and (2) and integrating over π. Then, feature counts Y follow the DM distribution 6 with probability function defined as: is an additional factor when representing the Dirichlet-multinomial covariance to the ordinary multinomial covariance. c depends on concentration parameter γ + which controls the degree of overdispersion and is inversely proportional to variance of Y.
allowing for the fact that other unknown parameters are estimated from the same data 54,55 .
In the DM model parameterization of our choice, we are interested in estimating the precision (concentration) parameter, γ + (inverse proportional to dispersion θ). Hence, at this stage, proportions π 1 and π 2 can be considered nuisance parameters and the profile log-likelihood (PL) for γ + can be constructed by maximizing the log-likelihood function with respect to proportions π 1 and π 2 for fixed γ + : The profile likelihood is then treated as an ordinary likelihood function for estimation and inference about parameters of interest. Unfortunately, with large numbers of nuisance parameters, this approach can produce inefficient or even inconsistent estimates 54,55 . To correct for that, one can apply an adjustment proposed by Cox and Reid 56 and obtain an adjusted profile likelihood (APL): where det denotes determinant and I is the observed information matrix for π 1 and π 2 . The interpretation of the correction term in APL is that it penalizes values of γ + for which the information about π 1 and π 2 is relatively large. When data consists of many samples, one can use gene-wise dispersion estimates, i.e., the dispersion is estimated for each gene g = 1,…,G separately: These estimates become more unstable as the sample size decreases. At the other extreme, one can assume a common dispersion for all genes and use all genes to estimate it: Common dispersion estimates are more stable but the assumption of a single dispersion for all genes is rather strong, given that some genes are under tighter regulation than others 57,58 . Thus, moderated dispersion is a trade-off between gene-wise and common dispersion and estimates are calculated with an empirical Bayes approach, which uses a weighted combination of the common and individual likelihood: .
If a dispersion-mean trend is present (see Figure S16, Figure S17, Figure S28 and Figure S29 in Supplementary File), as commonly observed in gene-level differential expression analyses 1,3 , one can apply shrinkage towards this trend instead of to the common dispersion: , where C is a set of genes that have similar gene expression as gene g and W is a weight defining the strength of moderation (see Supplementary Note 1 for further details).

Estimation and inference: simulations from the Dirichlet-multinomial model
We first investigated the performance of the DM model and the approach for parameter estimation and inference in the case where only few replicates are available. We performed simulations that correspond to a two-group comparison with no DTU (i.e. null model) where feature counts were generated from the DM distribution with identical parameters in both groups. Simulations were repeated 50 times for 1000 genes. In these simulations, we can vary the overall expression (m), number of features (q), proportions (prop) and sample size in one condition (n). Proportions follow a uniform or decaying distribution or are estimated based on kallisto transcripts or HTSeq exon counts from Kim et al. and Brooks et al. data (more details on these datasets below). In the first case, all genes have the same (common) dispersion, and in the second one, each gene has different (genewise) dispersion. Simulations for evaluating the dispersion moderation are intended to better resemble a real dataset. For these instances (repeated 25 times for 5000 genes), genes have expression, dispersion and proportions that were estimated from the real data. See Supplementary Note 3 for the additional details. Figure 1A and Figure S1 confirm that using the Cox-Reid adjustment (CR) improves the estimation (in terms of median absolute error and extreme error values) of the concentration parameter γ + in comparison to raw profile likelihood (PL) estimates. Additionally, the median error of concentration estimates for Cox-Reid APL is always lower than for PL or maximum likelihood (ML) used in the dirmult package 7 ( Figure 1C, Figure S2). This translates directly into the inference performance where the CR approach leads to lower false positive (FP) rate than other approaches ( Figure 1B, Figure S3).
Accurate estimates of dispersion do not always lead to expected control of FP rate. Notably, using the true concentration parameters in genes with many features (with decaying proportions) results in higher than expected nominal FP rates ( Figure 1B, Figure S3 and Figure S6A). Meanwhile, for genes with uniform proportions, even with many features, the FP rate for true dispersion is controlled ( Figure S3 and Figure S6B). Also, the Cox-Reid adjustment tends to underestimate the concentration (overestimate dispersion) for genes with many features and decaying proportions, especially for very small sample size ( Figure 1C, Figure S2, Figure S5A, Figure S5E), which leads to accurate FP rate control not achieved even with the true dispersion ( Figure S6A).
As expected, common dispersion estimation is effective when all genes indeed have the same dispersion, though this cannot be generally assumed in most real RNA-seq datasets (see results of simulations in the following section). In contrast, pure gene-wise estimates of dispersion lead to relatively high estimation error in small sample sizes ( Figure 1A, Figure S1 and Figure S8). Thus, sharing information about concentration (dispersion) between genes by moderating the gene-wise APL is applied. This improves concentration estimation in terms of median error ( Figure 1C and Figure S8) and by shrinking extremely large values (on the boundary of the parameter space, see Figure S7) toward common or trended concentration. Therefore, moderated gene-wise estimates lead to better control of the nominal FP rate ( Figure 1B and Figure S10).
In these simulations, the overall best performance of the DM model is achieved when dispersion parameters are estimated with the Cox-Reid APL and the dispersion moderation is applied. This strategy leads to p-value distributions that in most of the cases are closer to the uniform distribution ( Figure 1D, Figure S4 and Figure S11).

Comparison on simulations that mimic real RNA-seq data
Next, we use the simulated data from Soneson et al. 23 , where RNA-seq reads were generated such that 1000 genes had isoform switches between two conditions of the two most abundant transcripts. For each condition three replicates were simulated resulting in 3 versus 3 comparisons. Altogether, we summarize results for three scenarios: i) Drosophila melanogaster with no differential gene expression; ii) Homo sapiens without differential gene expression; iii) Homo sapiens with differential gene expression.
The aim of these analyses is to compare the performance of DRIMSeq against DEXSeq, which emerged among the top performing methods for detection of DTU from RNA-seq data 23 . For DRIMSeq, we consider different dispersion estimates: common, gene-wise with no moderation and with moderation-to-common and to-trended dispersion. We use the exonic bin counts provided by HTSeq (same input to the DEXSeq pipeline), and transcript counts obtained with kallisto. Additionally, we use HTSeq and kallisto counts that are re-estimated after the removal of lowly expressed transcripts (less than 5% in all samples) from the gene annotation (pre-filtering) as proposed by Soneson et al. 23 and kallisto filtered counts that exclude the lowly expressed transcripts (also less than 5% in all samples). DRIMSeq returns a p-value per gene. To make results comparable, we used the module within DEXSeq that summarizes exon-level p-values to a gene-level adjusted p-value.
As expected, common dispersion estimates lead to worse performance (lower power and higher FDR) compared to gene-wise dispersions. DRIMSeq achieves the best performance with moderated gene-wise dispersion estimates, while the difference in performance between moderating to common or to trended dispersion is quite small, with moderated-to-trend dispersion estimates being slightly more conservative (Figure 2 and Figure S15).
As noted by Soneson et al. 23 , detecting DTU in human is harder than in fruit fly due to the more complex transcriptome of the first one; all methods have much smaller false discovery rate (FDR). Nevertheless, none of the methods manages to control the FDR at a given threshold in either of the simulations.
Annotation pre-filtering, suggested as a solution to mitigate high FDRs 23 , affects DEXSeq and DRIMSeq in a different way. For DEXSeq, it strongly reduces the FDR. For DRIMSeq, it increases power without a strong reduction of FDR. Moreover, the results for kallisto filtered and pre-filtered are almost identical ( Figure S15 and Figure S24), which means that the re-estimation step based on the reduced annotation is not necessary for kallisto when used with DRIMSeq or DEXSeq. Additionally, we have considered how other filtering approaches affect DTU detection.
From Figure S24, we can see that DS analysis based on transcript counts are more robust to different variations of filtering and indeed some filtering improves the inference. For exonic counts, filtering should be less stringent and the pre-filtering approach is the best performing strategy.
DRIMSeq performs well when coupled with transcript counts from kallisto. In the case when no filtering is applied to the data, it outperforms DEXSeq. When transcript counts are pre-filtered, both methods have very similar performance ( Figure S15). For both differential engines, the performance decreases substantially with increasing number of transcripts per gene, with DRIMSeq having slightly more power when genes have only a few transcripts ( Figure S17). DRIMSeq has poor performance for the exonic counts in the human simulation, where achieved FDRs of more than 50% are observed for an expected 5%; consequently, we recommend the use of DRIMSeq on transcript counts only. On the other hand, the concordance of DRIMSeq and DEXSeq top-ranked genes is quite high and similar even for exonic counts ( Figure S16).
The p-value distributions highlight a better fit of the DM model to transcript counts compared to exonic counts (it is more uniform with a sharp peak close to zero). Similarly, dispersion estimation gives better results for transcript counts ( Figure S19 and Figure S20). In particular, for exonic counts, a large number of genes have concentration parameter estimates at the boundary of the parameter space, unlike the situation for transcript counts ( Figure S19 and Figure S20).

DS analyses on real datasets
To compare the methods further, we consider two public RNA-seq data sets. The first is the pasilla dataset 59 (Brooks et al.). The aim was to identify genes regulated by pasilla, the Drosophila ortholog of mammalian splicing factors NOVA1 and NOVA2. In this experiment, libraries were prepared from seven biologically independent samples: four control samples and three samples in which pasilla was knocked down. Libraries were sequenced using a mixture of single-end and paired-end reads as well as different read lengths. The second data set is from matched human lung normal and adenocarcinoma samples from six Korean female nonsmoking patients 60 , using paired-end reads (Kim et al.).
Both datasets have a more complex design than those used in the simulations; in addition to the grouping variable of interest, there are additional covariates to adjust for (e.g., library layout for the fruit fly data, patient identifier for the paired human study). In order to account for such effects, one should rather use a regression approach, which currently is not supported by DRIMSeq, but can be applied within DEXSeq's GLM framework. To make the comparison fair, we fit multiple models. For the pasilla dataset, we compare four control samples versus three pasilla knock-down samples without taking into account the library layout (model full) as well as compare only the paired-end samples, which removes the covariate. To not diminish DEXSeq for its ability to fit more complex models, we run it using a model that does the four control versus three knock-down comparison with library layout as an additional covariate (model full 2). For the adenocarcinoma data, we do a two-group comparison of six normal versus six cancer samples (model full) and for DEXSeq, we fit an extra model that takes into account patient effects (model full 2). Additionally, we do so-called "mock" analyses where samples from the same condition are compared (model null), and the expectation is to detect no DS since it is a within-condition comparison (see Supplementary Note 5 for the exact definition of these null models).
In the full comparisons with transcript counts, DRIMSeq calls similar or fewer DS genes than DEXSeq, and a majority of them are contained within the DEXSeq calls ( Figure S26, Figure S27) showing high concordance between DRIMSeq and DEXSeq and slightly more conservative nature of DRIMSeq. Accounting for covariates in DEXSeq (model full 2) or performing the analysis on a subgroup without covariates (model full paired) results in more DS genes detected ( Figure S28, Figure S29 and Figure S30).
In the ''mock" analyses, as expected, both methods detect considerably fewer DS genes, except in two cases. First, for the pasilla data (model null 3), where the two versus two control samples were from single-end library in one group and from paired-end library in the second group, leading to a comparison between batches in which both of the methods found more DS genes than in the comparison of control versus knock-down showing that the "batch" effect is very strong. Second, in the adenocarcinoma data (model null normal 1), where the two groups of individuals (each consisting of three women) happened to be very distinct ( Figure S25). Therefore, we do not include these two cases when referring to the null models.
Overall, in the full comparisons, there are more DS genes detected based on differential transcript usage than differential exon usage ( Figure S26). For DEXSeq, this is also the case in the null comparisons, which shows that DEXSeq works better with exonic counts than with transcript counts. DRIMSeq, on the other hand, has better performance on transcript counts, for which it calls less DS genes in the null analysis than with exon counts. In particular, the p-value distributions under the null indicate that DM fits better to transcript counts than exon counts ( Figure S14, Figure S31 and Figure S32).
Method comparisons based on real data are very challenging as the truth is simply not known. In this sense the pasilla data is very precious, as the authors of this study have validated alternative usage of exons in 16 genes using RT-PCR. Of course, these validations represent an incomplete truth, and ideally, large-scale independent validation would be needed to comprehensively compare the DTU detection methods. In Figure 3, Figure S33, Figure S34 and  Figure S35 again it is shown that DRIMSeq is slightly more conservative than DEXSeq. DRIMSeq performs poorly on exon-level but returns strong performance on transcript-level quantification (e.g., kallisto) and even outperforms DEXSeq when the sample size is very small (model full paired).

tuQTL analyses
To demonstrate the application of DRIMSeq to tuQTL analysis, we use the data from the GEUVADIS project 46 where 465 RNA-seq samples from lymphoblastoid cell lines were sequenced, 422 of which were sequenced in the 1000 Genomes Project Phase 1. Here, we present the analysis of 91 samples corresponding to the CEU population and 89 samples from the YRI population. Expected transcript counts (obtained with Flux Capacitor) and genotype data were downloaded from the GEUVADIS project website. We choose to compare the performance of DRIMSeq with sQTLseekeR, because it is a very recent tool that performs well 35 , can be directly applied to transcript count data and models transcript usage as a multivariate outcome.
For both of the methods, we investigate only the bi-allelic SNPs with a minor allele present in at least five samples (minor allele frequency approximately equal to 5%) and at least two alleles present in a population. Given a gene, we keep the SNPs that are located within 5 Kb upstream or downstream of the gene. We use the same pre-filtered counts in DRIMSeq and sQTLseekeR to have the same baseline for the comparison of the statistical engines offered by these packages. We keep the protein coding genes that have at least 10 counts in 70 or more samples and at least two transcripts left after the transcript filtering, which keeps the one that has at least 10 counts and proportion of at least 5% in 5 or more samples. The numbers of tested and associated genes and tuQTLs are indicated in Figure 4, Figure S38 and Figure S39.
In Figure 4A and Figure S40 we can see that the concordance between DRIMSeq and sQTLseekeR is quite high and reaches 75%. Nevertheless, there is considerable difference between the number and type of genes that are uniquely identified by each method. sQTLseekeR finds more genes with alternative splicing associated to genetic variation ( Figure S38 and Figure S39), but these genes have fewer transcripts expressed and lower overall expression in comparison to genes detected by DRIMSeq ( Figure 4C, Figure 4D, Figure S40C and Figure S40D). To further investigate characteristics of detected tuQTLs, we measured enrichment of splicingrelated features as used in a previous comparison 35 . This includes the location of tuQTLs within exons, within splice sites, in the surrounding of GWAS SNPs and distance to the closest exon. tuQTLs detected by DRIMSeq show higher enrichment for all splicing related features (Table 1 and Figure 4B), than sQTLseekeR tuQTLs, suggesting that by accounting for the overall gene expression, one can detect more meaningful associations.

Discussion
We have created a statistical framework called DRIMSeq based on the Dirichlet-multinomial distribution to model alternative usage of transcript isoforms from RNA-seq data. We have shown that this framework can be used for detecting differential isoform usage between experimental conditions as well as for identifying tuQTLs. In principle, the framework is suitable for differential analysis of any type of multinomial-like responses. From our simulations and real data analyses towards DS and sQTL analyses, DRIMSeq seems better suited to model transcript counts rather than exonic counts.
Overall, there are many tradeoffs to be made in DS analyses. For example, deriving transcript abundances from RNA-seq data is more difficult (e.g., complicated overlapping genes at medium to low expression levels) than directly counting exon inclusion levels of specific events. On the other hand, local splicing events may be not able to capture biologically interesting splice changes (e.g., switching between two different transcripts) but have ultimately more ability to detect DS in case when the transcript catalog is incomplete. Despite these tradeoffs and given the results observed here, DRIMSeq finds its place as a method to make downstream calculations on transcript quantifications. With emerging technologies that sequence longer DNA fragments (either truly or synthetically), we may see in the near future more direct counting of full-length transcripts, making transcript-level quantification more robust and accurate. Even with current standard RNA-seq data, ultrafast and lightweight methods make transcript counting more accessible and users will want to make comparative analyses directly from these estimates.
In principle, existing DS methods that allow multiple group comparisons could be adapted to the sQTL framework and vice versa; DRIMSeq is one of few tools that bridge these two applications. In particular, parameter estimation with DRIMSeq is suited for a situation where only a few replicates are available per group (common in DS analysis) as well as analyses over larger samples sizes (typical in sQTL analysis). For small sample sizes, accurate dispersion estimation is especially challenging. Thus, we incorporate estimation techniques analogous to those used in negative binomial frameworks, such as Cox-Reid APL; perhaps not surprisingly, raw profile likelihood or standard maximum likelihood approaches do not perform as well in our tests of estimation performance. In addition, as with many successful genomics modeling frameworks, sharing information across genes leads to more stable and accurate estimation and therefore better inference (e.g., better control of nominal FP rates).
In comparison to other available methods, DRIMSeq seems to be more conservative than both DEXSeq (using transcript counts) and sQTLseekeR, identifying fewer DTU genes and tuQTLs, respectively. On the other hand, DEXSeq is known to be somewhat liberal 23 . Moreover, the sQTL associations detected by DRIMSeq have more enrichment in splicing-related features than sQTLseekeR tuQTLs, which could be due to the fact that DRIMSeq accounts for the higher uncertainty of lowly expressed genes by using transcript counts instead of transcript ratios.
Our developed DM framework is general enough that it can be applied to other genomic data with multivariate count outcomes. For example, PolyA-seq data quantifies the usage of multiple RNA polyadenylation sites. During polyadenylation, poly(A) tails can be added to different sites and thus more than one transcript can be produced from a single gene (alternative polyadenylation); comparisons between groups of replicates can be conducted with DRIMSeq. As mentioned, the DM distribution is a multivariate generalization of the beta-binomial distribution, as the binomial and beta distributions are univariate versions of the multinomial and Dirichlet distributions, respectively. Although untested here, the DRIMSeq framework could be applied to analyses where the beta-binomial distribution are used with the advantage of naturally accommodating small-sample datasets. Interesting betabinomial-based analyses include differential methylation using bisulphite sequencing data, where counts of methylated and unmethylated cytosines (a bivariate outcome) at specific genomic loci are compared, or allele-specific gene expression, where the expression of two alleles (again, a bivariate outcome) are compared across experimental groups.
One particularly important future enhancement is a regression framework, which would allow direct analysis of more complex experimental designs. For example, covariates such as batch, sample pairing or other factors could be adjusted for in the model. In the tuQTL analysis, it would allow studying samples from the pooled populations, with the subpopulation as a covariate, allowing larger sample sizes and increased power to detect interesting changes. Another potential limitation is that DRIMSeq treats transcript estimates as fixed, even though they have different uncertainty, depending on the read coverage and complexity of the set of transcripts within a gene. Although untested here, propagation of this uncertainty could be achieved by incorporating observational weights that are inversely proportional to estimated uncertainties or, in case of fast quantification methods like kallisto, by making effective use of bootstrap samples. At this stage, there is no consensus on how these approaches will perform and ultimately may require considerable additional computation.

Software availability
The Dirichlet-multinomial framework described in this paper is implemented within an R package called DRIMSeq. In addition to the user friendly workflow for the DTU and tuQTL analyses, it provides plotting functions that generate diagnostic figures such as the dispersion versus mean gene expression figures and histograms of p-values. User can also generate figures of the observed proportions and the DM estimated ratios for the genes of interest to visually investigate their individual splicing patterns.
The release version of DRIMSeq is available on Bioconductor http://bioconductor.org/packages/DRIMSeq, and the latest development version can be found on GitHub https://github.com/markrobinsonuzh/DRIMSeq.

Data availability
Data for simulations that mimic real RNA-seq was obtained from Soneson et al. 23 , where all the details on data generation and accessibility are available.
Differential splicing analyses were performed on the publicly available pasilla dataset, which was downloaded from the NCBI's Gene Expression Omnibus (GEO) under the accession number GSE18508, and adenocarcinoma dataset under the accession number GSE37764.
Data for the tuQTL analyses was downloaded from the GEUVADIS project website.
All the details about data availability and preprocessing are described in the Supplementary Materials.
Archived source code as at the time of publication DRIMSeq analyses for this paper were done with version 0.3.3 available on Zenodo https://zenodo.org/record/53084 61 and Bioconductor release 3.2. Source code used for the analyses in this paper is available on Zenodo https://zenodo.org/record/167305 62 .
Author contributions MN drafted the manuscript, designed the analyses, analyzed the data and implemented the DRIMSeq R package. MDR drafted the manuscript and designed the overall study. All authors read and approved the final manuscript and have agreed to the content.

Competing interests
No competing interests were disclosed. The authors say that it is not worth made a comparison with Cufdiff because in the study by , where both authors of DRIMSeq were involved, Cudiff was very conservative in detecting al. differential isoform/transcript usage (DTU). In that paper the authors assess DTU by switching the two most abundant isoforms and show that Cufdiff has a low true positive rate (TPR) at small magnitudes of the difference in relative abundance between the two most abundant isoforms per gene. However, in Supplementary Figure 10 of that paper, the authors show that at larger magnitudes of that difference, the TPR of Cufdiff improves substantially while correctly controlling the false discovery rate (FDR).

Grant information
In this paper the authors assess DTU following the same strategy of switching the two most abundant isoforms and I think it would be again very interesting to see how Cufdiff and DRIMSeq compare at different magnitudes of the change in isoform usage. The authors also argue that Frazee (2014) et al. find that Cuffdiff is very conservative. However, as far as I understand that paper, Frazee and co-workers are not evaluating DTU but differential transcript expression (DTE), and therefore, in my view, the experiments conducted on that paper do not warrant the conclusion that Cufdiff is overly conservative for DTU.  3.

4.
inclusion across conditions, which is a consequence of differential isoform usage, while transcript-quantification values and DRIMSeq can be used to directly investigate differential isoform usage.
A potentially interesting outcome of this comparison in the paper could be some sort of guidelines about when is it more sensible to investigate differential exon inclusion or differential isoform usage, depending on factors such as the biological question at hand, sequencing depth or number of biological replicates. However, this is apparently beyond the scope of this paper and the experimental results are in principle geared towards convincing the reader that DRIMSeq improves on existing approaches to discover changes in isoform usage, as suggested in the abstract. In my view, the experimental results do not address this question and I would suggest the authors to compare DRIMSeq with methods that also work with transcript-quantification values and assess differential isoform usage such as, for instance, Cuffdiff or sleuth .
The experimental results on searching for sQTLs compare favourably DRIMSeq with an existing tool for that purpose, sQTLseekR . Evaluating performance in this context is challenging and the idea of assessing enrichment with respect to splicing-related features is a good one. However, the (two) presented features in Table 1 could be made more precise. It is unclear that a SNP close to a GWAS hit should be necessarily related to splicing and it is also unclear why one should expect splicing-related enrichment more than a few hundred nucleotides away from the intervening exon. While it is technically interesting to see a method being used to address two completely different research questions, in my view, mixing both types of analyses makes the article less focused. I would argue that both questions deserve separate papers, and that would allow the authors to investigate in depth critical aspects of both types of analysis that are currently not addressed in the current article.
In summary, this article provides an interesting new methodology for the analysis of differential isoform usage from RNA-seq data, it is well-written and the implemented software runs smoothly and is well documented. However, in my view, the current experimental results of the article are not that informative for the reader to learn what advantages DRIMSeq provides over other tools for differential isoform usage analysis, and to decide whether he/she should be doing a differential isoform usage, or a differential exon inclusion analysis, if this were a goal of the comparison with DEXSeq.
Minor comments: I would replace the term "edgeR ideology" in page 5 by "edgeR strategy". In page 9 it is described that the distributions of raw p-values shown in Supplementary Figures S28 and S29 fit "better" when derived from transcript quantification values than from exonic-bin count values, but in fact in both cases the distributions are non-uniform for p-values distributed under the null hypothesis. This can be easily shown with the data from vignette of the DRIMSeq package when skipping the step that reduces the transcript set to analyze to speed up the building time of the vignette. This is not openly discussed in the paper but I would argue that it is quite critical to know under what technical assumptions the proposed hypothesis test leads to uniform raw p-values under the null, as this has a direct consequence on the control of the probability of the type-I error.
The sQTL analysis described in pages 9, 10 and 11 uses transcript-quantification values from FluxCapacitor. If the entire first part of the paper shows the performance metrics of DRIMSeq using kallisto, in my view, it would make more sense to use kallisto for this analysis as well.
With regard to the implementation in the R/Bioconductor software package DRIMSeq, the authors have implemented a specialized S4 object class called 'dmDSdata' to act as a container for counts and information about samples. Since the package forms part of the Bioconductor project, I think it would better for both, the end-user and the developer authors, that the package re-uses the 3 4 5 F1000Research to exon counts and to transcript quantifications. However, from our comparisons, which were performed at the gene-level, the performance of and is different on these DEXSeq DRIMSeq different types of counts.
performs better on exon counts and on transcript DEXSeq DRIMSeq counts.
We have not used [3] in our comparisons here because in the study by Soneson [1], it Cuffdiff et al. performed poorly compared to . In particular, was very conservative having low DEXSeq Cuffdiff false discovery rate (FDR) at the cost of very low power for detecting DTU. The conservative nature of for differential transcript expression, was also pointed out by Frazee [4]. We Cuffdiff et al. decided to compare only to the top performing method, . The other tool DRIMSeq DEXSeq proposed by the Reviewer, sleuth [5], is meant for differential transcript expression analyses, not DTU.
The scope of this paper was not to justify exon or transcript level analysis, for that one could refer to the comparison paper by Hooper [6], but to propose a methodologically-sound tool for differential isoform usage analysis or detect transcript usage QTLs based on transcript quantifications. We propose to use since it outperformed in this type of analysis DRIMSeq DEXSeq and there are no other tools for differential transcript usage that were intended for transcript level quantifications from the latest generation of fast quantification tools, such as [2] or kallisto Salmon [7].
Importantly, returns p-values per feature (exon or transcript), which can be also DEXSeq summarized to the gene level.
performs gene-level tests and returns p-values per gene DRIMSeq only. When the interest is in detecting specific exons or isoforms that change, one should use because currently does not provide any post hoc analysis (although in many DEXSeq DRIMSeq cases, the relevant information can be deduced from looking at the relative transcript expression from 's plots). We have not investigated the differences in performance due to DRIMSeq sequencing depth or number of biological replicates, but we believe that the requirements would be basically the same in these terms for both of the methods. What matters is the completeness of annotation. Detecting DTU based on exon counts is generally more robust than that based on transcript quantifications when the annotation is incomplete, which was investigated in detail by Soneson [1]. et al.
To compare the performance of and , we use the splicing-related features DRIMSeq sQTLseekeR that were also used in the paper [8] to compare against other methods. sQTLseekeR sQTLseekeR The Reviewer suggested to consider other splicing-related features, such as exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs) and splice sites. We have added the frequency of tuQTL overlapping with the splice sites to Table 1. However, we have not performed analyses on ESEs and ESSs since Lalonde [9] concluded from their study that "ESE et al. predictions themselves are a poor indicator of the effect of SNPs on splicing patterns".
By addressing differential splicing and sQTLs in one paper, our aim was to show that methods used for these analyses are based on statistical approaches that in the end tackle ultimately the same question: differential splicing between conditions. Both analyses employ the same methods for gene feature quantification and potentially one main differential engine could be used with slight analysis-specific adjustments, such as information sharing between genes for small sample size data or using genotypes as grouping factor, which is done in . We believe we have DRIMSeq addressed in sufficient depth aspects of both of these analyses providing comparisons on simulated and real data.
Addressing the minor comments:

F1000Research
Addressing the minor comments: We have replaced the term "edgeR ideology" in page 5 by "edgeR strategy".
As suggested, we have investigated in more depth, based on simulations from the DM model, the p-value distributions under the null hypothesis of no differential DRIMSeq transcript usage (Figures 1, S4, S6, S11, S14). Overall, using the Cox-Reid adjusted profile likelihood and the dispersion moderation leads to p-value distributions that in most cases are closer to the uniform distribution ( Figures 1D, S4 and S11). The better fit of the DM model to transcript counts in comparison to exon counts can be seen in Figure S14, where the p-value distributions are more uniform for simulations that mimic counts than for kallisto simulations that mimic counts. HTseq Yes, using counts would be more consistent with the rest of our manuscript. kallisto Nevertheless, we decided to to use the Flux Capacitor counts because they were already available on the GEUVADIS project website and have been used extensively in other projects, for example, in the sQTLseekeR paper. Moreover, we think that using other counts should not affect the comparison between DRIMSeq and sQTLseekeR.
We had already considered the SummarizedExperiment class while developing the DRIMSeq package. However, it does not provide features and functionality that we need for storing the count data and DRIMSeq results. In particular, the dimensions of Assays in SummarizedExperiment must be the same. That is not the case for us for two reasons. Firstly, each gene has multiple transcripts and, for example, the table with proportion estimates per transcript is larger than a table with dispersion estimates which are available per gene. Second, in the QTL analysis, table with transcript counts has different dimensions than table with genotypes. Additionally, we use matrices instead of data frames to store our data because the former occupies less space. Specifically, we have created a class called MatrixList, which is adjusted to store data where each gene has multiple features quantified and allows a quick access to these counts in per gene basis. We have not implemented the dim() method on dmDSdata or dmSQTLdata because we want to keep consistency between them and, for example, dmSQTLdata contains transcript counts and genotypes which have different dimensions. Thus we decided to make the dim() methods available for the counts and genotypes slots in these classes but not for the classes themselves.
We have addressed all the other minor comments which include: defining the abbreviations of profile likelihood (PL) and adjusted profile likelihood (APL), adding the sample size information about the simulations from Soneson [1], et al. in order to remove the misleading suggestion that DEXSeq fits GLMs only in the complex designs, we have changed the names of the models used in real data analysis from "model full glm" to "model full 2" and paraphrased the corresponding manuscript sections, we have included results for the panels with missing data in the Supplementary Figures S15, S16 and S24, we have included the citation to SGSeq [2] -the Bioconductor package for analyzing splice events from RNA-seq data, in the Supplementary Materials, we have prepared a section explaining abbreviations used in the subsequent Supplementary Figures.