Interpretation of mRNA splicing mutations in genetic disease: review of the literature and guidelines for information-theoretical analysis

The interpretation of genomic variants has become one of the paramount challenges in the post-genome sequencing era. In this review we summarize nearly 20 years of research on the applications of information theory (IT) to interpret coding and non-coding mutations that alter mRNA splicing in rare and common diseases. We compile and summarize the spectrum of published variants analyzed by IT, to provide a broad perspective of the distribution of deleterious natural and cryptic splice site variants detected, as well as those affecting splicing regulatory sequences. Results for natural splice site mutations can be interrogated dynamically with Splicing Mutation Calculator, a companion software program that computes changes in information content for any splice site substitution, linked to corresponding publications containing these mutations. The accuracy of IT-based analysis was assessed in the context of experimentally validated mutations. Because splice site information quantifies binding affinity, IT-based analyses can discern the differences between variants that account for the observed reduced (leaky) versus abolished mRNA splicing. We extend this principle by comparing predicted mutations in natural, cryptic, and regulatory splice sites with observed deleterious phenotypic and benign effects. Our analysis of 1727 variants revealed a number of general principles useful for ensuring portability of these analyses and accurate input and interpretation of mutations. We offer guidelines for optimal use of IT software for interpretation of mRNA splicing mutations.


Introduction
Pre-mRNA splicing is a necessary step in the production of a functional protein product. It consists of the recognition of intron/exon boundaries, and the subsequent excision of the introns. It is important to distinguish between alternate splicing isoforms and mutant splice forms. The former consists of using different combinations of splice sites for the same gene. It is estimated to occur in over 60% of human genes, some of which will have multiple alternate isoforms 1,2 . For example, NF1 is reported to produce 46 splice variants 3 . The cell regulates this naturally occurring process through the availability of tissue-specific splice factors. Alternative splicing is not generated by changes in the unspliced RNA sequence, whereas mutations that produce non-constitutive splice forms are the result of dysregulation of natural splice site recognition. Mutations can have various consequences to RNA processing, such as exon skipping, cryptic splicing, intron inclusion, leaky splicing, or less frequently, introduction of pseudo-exons into the processed mRNA. A broad range of molecular phenotypes are possible depending on the type and severity of the mutation, making it imperative to understand the consequences of splicing mutations. For the purposes of this review, we consider sequence changes in genes that affect transcript structure or abundance to be mutations, regardless of their allele frequencies. Although spliceosomal recognition and RNA binding factors are operative in mutation-derived and normal alternative mRNA splicing events, this review is focused on aberrant sequence changes that alter constitutive splicing, and often result in clinically abnormal phenotypes.
The process of U1/U2-based mRNA splicing involves the recognition of a number of key sequence components 4,5 , with exons defined by both intronic and exonic features 4,6 . The exonic and intronic sequences flanking the 5' end of an intron is termed the donor site and the 3' end, the acceptor site. In typical mRNA splicing, the natural donor and acceptor splice sites span intervals of 10 and 28 bases in length, respectively. It is a common misconception that these sequences (especially the dinucleotides immediately intronic to the exon) are invariant. Although highly conserved, these sequences vary at different splice junctions within a gene as well as between genes. The particular combination of nucleotides at each position within the same splice site determines its overall strength, which dictates the likelihood of recognition by the U1 and U2 spliceosomes.
In addition, binding sites for splicing regulatory elements have been shown to reside over a range of distances from the corresponding natural splice sites 7 ; the impact of these sites appears to be related to their binding affinities to the cognate RNA binding proteins and to their distance from the proximate intron/exon boundary 8 . Recognition sites for these regulatory proteins can reside either within introns or exons. Those within exons are commonly referred to as exonic splice enhancers or silencers (ESE or ESS, respectively), whereas the corresponding designations for intronic elements are ISE or ISS. Sequence variants affecting these protein-binding sites (or mutations in the binding proteins themselves) have been documented as contributing to aberrant splicing and pathogenic phenotypes. We focus on variants occurring in cis with target genes, as opposed to those in the splicing complex (in trans), leading to abnormal splicing. The efficiency and specificity of splicing depends on the combination of natural splice site strengths and the binding of splicing regulatory proteins that orchestrate exon recognition 9 .
Mutations that affect pre-mRNA splicing account for at least 15% of disease-causing mutations 10 with up to 50% of all mutations described in some genes 11,12 . Interpreting the effects that these variants have on splicing is not straightforward because natural and regulatory splice sites exhibit considerable sequence variation. Furthermore, performing in vitro experiments to verify the consequences of each variant is costly and time consuming, and may not be practical. In silico prediction methods have become essential resources for analyzing these variants. Software programs for splicing analysis use a wide variety of bioinformatic approaches. Several splice site prediction tools compare the predicted mutant sequence to a consensus sequence, based on a set of functional acceptor or donor splice sites 13 . A drawback of this approach is that low-frequency nucleotides present in functional splice sites are not represented, which can lead to misinterpretation and false-positive mutation predictions. One example of this was illustrated by Rogan and Schneider (1995), in which the variant, IVS12-6T>C in MSH2, described by Fishel et al. (1993) was predicted to be benign, despite being located 6 nt from the natural acceptor splice junction 14,15 . The consensus sequence fails to indicate that C and T at this position are nearly equally probable, which reclassified this transition as a polymorphism rather than a pathogenic variant. This conclusion is supported by evidence that ~10% of normal individuals without predisposition to non-polyposis colon cancer harbour this alternate allele 16 .
Over the last 20 years, we and others have developed an information theory (IT)-based approach for prediction of splicing mutations, and their impact on mRNA structure and abundance. The effects of these mutations is founded on the formal relationship between IT and the second law of thermodynamics, in that the change in information ascribed to a sequence variant within a splice site is directly related to thermodynamic entropy and free energy of binding 17,18 . A weight matrix consisting of the Shannon information (product of the probability of each nucleotide and -log 2 of its probability) at each position of the splice site is constructed. The individual information for a splice site (R i , in bits) is defined as the dot product of this weight matrix and the unitary vector of a particular splice site sequence. The magnitude of the information content of a nucleotide within a given site is an indication of its level of conservation relative to a set of functional sites. This method retains all of the sequence variability inherent in each model of donor and acceptor splice sites. By contrast, each base in the consensus sequence has the maximum R i value, which is actually rare in the human genome, and is generally not representative of the preponderance of natural splice sites. Prior to the introduction of IT-based approaches, consensus sequence-based methods were widely used 13 . Also, the use of neural networks, trained on sequences experimentally determined to be "bound" and "unbound", was another early approach used to predict splice sites 19 . However, these unbound set of sequences are known to harbour some contaminating functional sites 20,21 , which can limit the sensitivity and specificity of these networks 22 .
There are instances when IT does not accurately predict the consequences of a splice variant. This can often be attributed to instances involving multiple sites or multiple regulatory factors, which are not components of current splicing models. In addition, splicing regulatory proteins can share overlapping and degenerate binding sites, and may exert conflicting effects (for example, serine-arginine [SR] vs. hnRNP proteins), making in silico prediction less reliable and accurate in these cases 23 . Finally, functional cryptic splicing motifs occurring deep within the introns can be challenging to identify, because they tend to be less well conserved than natural splice sites 24,25 .
Nevertheless, a number of authors have recommended IT methods for analysis of splice site variants (N = 29; Supplementary Table 1). In fact, this approach has been described as equivalent to using a general reference textbook as a diagnostic tool, which complemented by functional assays, may provide a complete molecular diagnosis 26 . Most of the applications of IT for splicing mutation analysis have involved predominantly rare diseases, as well as some low frequency variants associated with more common genetic conditions. This is because IT has been used to assess how well computed changes in binding affinity conform to levels of expression and/or patient phenotypes.
Many IT studies have focused on sequence variants in individual disorders or genes. Our synopsis of the broader implications of this work sets the stage for this compilation of peer-reviewed variants with accompanying IT analyses. We cover all publications retrieved through PubMed and Google Scholar that cite the use of IT (N = 367; Supplementary Bibliography) before September 2014. These items include primary research articles, review articles, presentations, and theses. Of all references, 216 publications reported variants or other results or analyses pertinent to this review (Supplementary Table 2). In the remaining studies, analyses were either not performed, insufficient information was provided to reproduce the reported result, or authors described unrelated applications of IT-based analysis. We summarize the spectrum of variants analyzed to obtain a global perspective of splicing mutations resulting in genetic disease. We also highlight common errors that can occur in variant analysis and interpretation, and offer guidelines for optimal use of our software programs for interpretation of splicing mutations.

Information theory and splice site analysis
IT was first introduced by Claude Shannon in 1948 and is now used in a variety of disciplines to express the average number of bits (i.e. the information content) needed to communicate symbols in a message 27 . Bits are the basic unit used in computing and can have one of two values (typically the answer to a yes/no, true/false, or +/-problem). In nucleic acid molecular biology, the symbols in the message comprise a group of related, aligned sequences, with the average number of bits in the set corresponding to the amount of information in the message. This is determined from the information content at each position in the sequence, summed over all positions 28 . The average information is depicted graphically by a sequence logo, which stacks the individual nucleotides at each position ranked by frequency, and where the height of the stack is the position-specific contribution to the average information 29 . If the set of sequences are functional binding sites recognized by the same factor, the individual information in each site (i.e. R i value) is related to thermodynamic entropy, and thus, to the free energy of binding 18 .
The information content of a nucleic acid binding site is related to the affinity of its interaction with proteins and other macromolecular complexes, such as the case during mRNA splicing 18 . Information theory-based position weight matrices (PWM; R i [b,l] -also referred to as a ribl -where b and l correspond to the nucleotide and position in the splice site) can be determined for set of known binding sites, in this case, for the purpose of calculating individual and average sequence information 28 . Figure 1 shows an example of sequence logos for the canonical acceptor (or 3', recognized by the U2 spliceosome) and donor (or 5', recognized by the U1 spliceosome) splice sites, computed from the majority of constitutive sites at annotated splice junctions in the human genome 30 . The information contained within the natural splice donor site is distributed between the last codon of each exon and the adjacent 6 nucleotides of intronic sequence, whereas the acceptor sites are almost entirely intronic, extending 26 nucleotides upstream from the exon boundary.
The distributions of R i values for these sets are approximately Gaussian, with a couple of important exceptions, namely the distribution has defined upper and lower bounds 18 . The upper limit corresponds to the consensus sequence, as it is not possible to have stronger binding than an exact match to this sequence. The theoretical lower limit corresponds to R i = 0 bits. An R i value less than zero implies that energy would be required (ΔG > 0 kcal/mol) for a stable binding complex to form, i.e. that the event would not occur spontaneously without an exogenous source of energy. The minimum strength site is zero bits, the equilibrium state (ΔG = 0). Assuming the contacts at each position in the same binding site form independently, this approach is accurate and quantitative. Altering a nucleotide with high information (implying high prevalence and conservation at that position) will have a greater impact on binding, than if a less-well conserved base were altered. The change in information due to a mutation in a site (ΔR i ) is the difference between R i,final and R i,initial values, where R i,final is the information of the sequence containing the variant, and R i,initial the information of the reference (wild-type) sequence. The minimum fold change in binding affinity resulting from the mutation is an exponential function based on ΔR i , or ≥ 2 ΔRi (Ref. 18 ).

Software resources
Delila package/system Information analysis was originally performed using the Delila sequence analysis system, which included a language to process nucleic acid sequences, and a library of sequence tools to retrieve and process various types of sequence data 31,32 . Tools to measure information content of nucleic acid sequences were subsequently added to Delila 28 . Initially, models of information content of bacteriophage T7 RNA polymerase binding sites and other bacterial control systems were studied, and mRNA splice sites were subsequently developed 28,33 . Later, tools to display binding sites as sequence logos of average information, and sequence walkers showing individual information were incorporated into Delila 20,29 . The Automated Splice Site Analysis (ASSA) server introduced in 2004, and its successor, Automated Splice Site and Exon Definition Analysis server (http://splice.uwo.ca; ASSEDA), have been freely available throughout the last decade, and have been used for IT-based calculations on nucleic acid sequences for the preceding 20 years 34,35 . Both ASSA and ASSEDA still use the Delila program suite to retrieve sequences, calculate information content, and create sequence walker representations of individual binding sites.

ASSA/ASSEDA
To simplify mutation analysis, we built a web interface for variant analysis using Delila software as the processing backbone 34 . Our  [-25, +2] and 10 nucleotides for donor [-3, +6] from the first nucleotide of the splice junction (position 0). Nucleotide height represents its frequency at that position. The horizontal bar atop each stack indicates the standard deviation at that position. This figure was modified from Rogan et al. (2003) to include splice sites in genes on both strands of the annotated human reference genome 30 . B) The distribution of deleterious single-nucleotide variants reported at the natural acceptor (left) and donor (right) splice sites. The variants used to populate this graph (Supplementary Table 8) were included only if they were reported to negatively affect splicing (N = 419 for acceptors, 599 for donors). The image was aligned to the sequence logo (A) to illustrate potential correlation of number of splicing variants at a position to the information content at that position. aim was to standardize and facilitate IT-based mutation analysis by using Human Genome Variation Society (HGVS)-approved variant nomenclature (which has since become the worldwide standard), employing server-based retrieval/processing, and reporting results as concise predictions in both tabular and sequence walker display formats. Initially, ASSA results described mutations in relation to genome annotations from the first finished genome release (hg15) 34 . While many publications cited this version of ASSA for novel splicing mutation analysis, continued improvements have introduced more accurate reference sequences, annotations, and models (for both constitutive and regulatory splice sites) based on more comprehensive sets of binding sites. The ASSA server contained the original donor and acceptor information position weight matrices derived by manual curation of GenBank entries 33 , murine donor and acceptor weight matrices, a subset of splicing enhancer elements (SF2/ASF, SC35 and SRp40), and the lariat branch point recognition sequence 33 . ASSA reported the strengths of all potential sites predicted within the window selected by the user, highlighted those with the largest changes in R i , and computed the minimum fold change in binding affinity for each mutation or polymorphism. Tabular results were colour-coded. Unaltered sites above and below the R i,min (described in Minimum splice site information content and exceptions) were highlighted grey and white, respectively. Preexisting sites abolished by the variant (where R i,final < R i,min ) were marked in red, while leaky natural sites (R i,final ≥ R i,min ) were highlighted in blue. Cryptic sites that were created, strengthened, or weakened were highlighted in pink, green and teal, respectively. The server parsed any mutation type described precisely by the HGVS notation, including substitutions, insertions, deletions, and combinations of these changes 36 . Recapitulating variants described in articles before these guidelines were widely adopted proved to be time-consuming and error-prone 22 . Multiple binding factors had to be analyzed simultaneously; however, results were reported independently. The analysis did not consider other factors relevant to splice site recognition, such as the resulting exon size, or potential formation of cryptically spliced exons.
ASSEDA, the successor software to ASSA, provides a new isoform-oriented type of mutation interpretation, updates the coordinate system to HG19 (GRCh37), adds current gene and single nucleotide polymorphism (SNP) annotations (dbSNP135), and provides additional ribls for other splicing regulatory sites (SRp55, TIA1, ELAVL1, hnRNP A1, hnRNP H, and PTB). All models, except those for SRp55 and hnRNP H, have been built using sequences from publicly available CLIP-seq data, and are based on a larger number of binding site sequences. They have been tested by comparing predictions to validated binding sites from published primary literature, and to any splice-altering variants found within them 35 . ASSEDA introduces in silico exon definition analysis by computing the total splicing information across an exon 35 . Total exon information (R i,total ) is the sum of the corresponding donor and acceptor R i values, and corrected for the gap surprisal term, which is based on the length of the potential exon formed using those sites (from RefSeq) 37 . The gap surprisal function is based on the genomewide distribution of constitutive exon lengths, also known as selfinformation. This term ensures that exons are computationally defined using donor and acceptor splice sites in close proximity 37,38 .
Exons of uncommon length lead to large negative gap surprisal terms, which reduces R i,total . When applied to predicted exons that activate a cryptic splice site, comparison of R i,total values can more accurately predict cryptic site use than the strength of this site alone. The gap surprisal term decreases the predicted R i,total value of particularly long internal exons (eg. the 3.4 kb long exon 11 of BRCA1; R i,total = 1.4 bits), which tends to compensate for this effect with strong splice sites and other sequence elements that enhance natural splice site recognition and suppress internal cryptic splice sites.
The exon definition paradigm extends to the assessment of the impact of mutations in ESE/ISS elements. ASSEDA calculates R i,total by adding the R i value of a regulatory splicing element to the contributions of constitutive splice sites, and applying a second gap surprisal term based on the frequency of distance from the splicing element to the nearest natural site. Currently, the effect of only a single splicing factor can be evaluated by the software, although the approach itself is generalizable to multiple regulatory binding sites. If a variant causes changes in the R i values of multiple sites, such as the simultaneous creation of both splicing enhancer and repressor elements, there will be less confidence in ASSEDA's predictions.
Two distinct sets of IT-based models for donors and acceptors are available on ASSEDA. The manually curated ribls were originally determined from 1799 donor and 1744 acceptor sites 33 . We subsequently derived a set of ribl matrices from genome-wide exon annotations 30 . These models were automatically curated using the criteria that enforced R i > 0 for correctly annotated sites. The resultant models consisted of 108,079 acceptor and 111,772 donor splice sites, however these were not formally implemented on the ASSA server until 2011 30 . These genome-wide models are used in the calculation of R i,total values. The ΔR i values for a single nucleotide splicing variant are similar for both sets of models. Variants having opposite predicted effects between the respective donor or acceptor ribls have not been reported. In general, the genome-wide models report slightly lower information contents, however the frequencies of nucleotides at the 5' end of the acceptor site differ significantly. This results in differences in the weights in the -4 to -20 nt region between the manually-curated and the genome-wide acceptor ribl matrix, which can significantly lower R i values based on the genome-wide model. In the genome, thymine is more prevalent than cytosine at these positions and has a higher positive contribution to the overall R i . This can account for up to a 1.97 bit difference between the models. Guanine nucleotides within this sequence window significantly lower the R i values computed from the genome-wide acceptor ribl, as well. While these differences contribute only a 0.1-0.4 bit difference to the R i per nucleotide, the cumulative effect of multiple differences within this window can lead to significant differences between the acceptor R i values.

Shannon Pipeline and Veridical
High-throughput DNA sequencing is generating a deluge of novel variants in patients with genetic diseases, most of which currently have unknown significance (VUS). For example, 20% of the patients with Pelizaeus-Merzbacher disease possess VUS, among which are single or compound heterozygous, rare pathogenic mutations 39 . Many solutions have been proposed, however prediction of pathogenicity by bioinformatic analyses is often inaccurate 40 . The Shannon Human Splicing Mutation Pipeline software predicts mutations at genome scale to predict which variants may alter mRNA splicing and is based on the same principles and IT models used in ASSA and ASSEDA 41 . However, this software processes ~5 million substitutions and/or indels in 10-15 minutes. While initially only available for the CLC-Bio Genomics platform, this software is now offered as a web service (http://shannonpipeline.cytognomix. com ). Variants are batched in standard variant call format (VCF). The pipeline reports any genic variant that affects a known natural site or a cryptic site where R i,initial or R i,final are ≥ 0 bits and ΔR i ≥ 1.0 bits, however more stringent criteria for selecting variants with significant information changes can be applied.
In Shirley et al. (2013), all variants from the complete genomes of three cancer cell lines (A431, U2OS, U251; N = 816,275) were analyzed 41 . Variants that were common (≥ 1%) were removed. Variants that weakened natural sites, or strengthened cryptic sites to levels comparable to or exceeding the strength to the nearest natural site, were flagged. Variants that strengthen a natural site could have an effect on the splicing profile of a gene (i.e. reduce the frequency of exon skipping for the corresponding exon), but are less likely to cause a deleterious phenotype. The overall fraction of mutations flagged, after filtering out distant cryptic sites and small ΔR i values, averaged 0.016%, illustrating how the software can be used for prioritizing variants. Some of the prioritized variants occurred in genes with known defective functional and biochemical pathways in these cancer cell types, i.e. cytokine signalling (in A431), DNA replication and cell cycle (in U2OS). Natural splice mutations were confirmed by expression data to a greater extent than either leaky or cryptic splice site variants.
In a complete cancer cell line genome, the number of cryptic sites with altered R i values greatly exceeds the number of affected natural splice sites. Many of these are weak decoys, which can occur throughout genes. Using the principle that novel cryptic sites that are likely to be activated must compete with the natural splice site for spliceosomal recognition, the relevant cryptic sites are restricted to those with R i values comparable to or greater than the corresponding strength of the adjacent natural site of the same polarity 22 . Additionally, the proximity of potential cryptic sites to the natural site should be considered in assessing whether an exon could be formed with the natural splice site of opposite polarity. Cryptic sites that are considerably weaker than the nearest natural site of the same type, or cryptic sites that would lead to unusually large exons, diminish the likelihood of cryptic site activation. Benaglio et al. (2014) used the Shannon Pipeline to screen 303 sequenced patients and flagged five variants that each strengthened or created a different cryptic site 42 . While comparable in strength to the natural site, these were all distant (>400 nucleotides away) and thus, less likely to be recognized. The authors also stated that the ΔR i values for three of these sites were discordant with results obtained with NNSplice, a neural network-based splicing prediction program. In fact, both the Shannon Pipeline and NNSplice demonstrated strengthening of these decoy cryptic splice sites.
Shirley et al. (2013) evaluated the predictions of the Shannon Pipeline by manually inspecting RNAseq data for each variant with significant information changes in each cell line 41 . However, manual review is unfeasible for many large datasets, especially from tumors, because of the large numbers of potential somatic mutations affecting splicing in each genome. Veridical, an in silico method for validation of DNA sequencing variants that alter mRNA splicing, has been developed to provide high throughput, statistically-robust unbiased evaluation based on RNAseq data 43 . The method has been implemented as software for analysis of potential splicing variants from large datasets and catalogues their effects. Veridical takes Shannon Pipeline output from predicted genomic variants with effects on splicing and performs a case-control analysis of corresponding expressed transcripts that cover the same genomic region, taken from normal tissues. Upon Yeo-Johnson transformation of the expressed read count distribution, parametric statistics are used to compare normal and abnormal mRNA species (exon skipping, intron inclusion, and cryptic site use). Veridical is designed to be used with large data sets, as the statistical analysis gains power with increasing numbers of control samples. A recent study of 442 breast cancer tumors from the Cancer Genome Atlas Project revealed 5,206 putative splicing mutations using the Shannon Pipeline. Veridical was then used to confirm exon skipping, leaky or cryptic splicing of 988 of these variants 44 .

Natural sites
The early splice site recognition literature often oversimplified the composition of the U1/U2-type 5' donor and 3' acceptor sites by presenting only consensus sequences and truncating the positions in each site 13,45,46 . However, the conserved tracts extend well beyond the canonical GT and AG dinucleotides adjacent to intron/exon junctions. Furthermore, a small, albeit significant, proportion of natural donor sites (~800, or 0.7%) contain cytosine at position +2 in the genome. This is reflected by a corresponding small decrease in average information at this position ( Figure 1). Sequences adjacent to these positions are more variable, but are nevertheless essential for the accurate recognition by the spliceosome. Specifically, the donor site is defined by the three terminal nucleotides of each exon and the first seven bases of the downstream intron. Conversely, acceptor sites are represented by the first two bases of the exon and the last 26 bases of the upstream intron. Because ASSA and ASSEDA use an integer-based coordinate system, there is a zero coordinate at the first intronic base of each splice site (Figure 1), which is not used in the conventional numbering system. The coordinate ranges for the donor and acceptor site positions are therefore [-3, +6] and [-25, +2], respectively. Individual information analysis computes the R i values over these intervals for normal and variant-containing splice sites. As discussed below, information content present in intronic intervals justifies sequencing and analyses of sequences well beyond the locations of the splice junctions themselves.
Certain variants within donor and acceptor sites are tolerated and may even have benign effects, while others have a deleterious impact on spliceosomal recognition. IT accounts for all of these possible outcomes. Unusual donor sites (i.e. with cytosine at position +2) are detected by information analysis, but could be falsely called deleterious by consensus sequence-based methods. Although the terminal position of exons contributes significantly to donor splice sites with a preference for G, a significant proportion of sites naturally possess A or U at this position, or less frequently, C.
Of the published IT-based variant analyses, single nucleotide variants (SNVs) that were reported to affect a natural splice site (multi-nucleotide and insertion/deletion variants are listed separately in Supplementary Table 3) were compiled and reanalyzed. After reducing this set to only those variants occurring within the intervals covered by the splice site information weight matrices described above, 1152 SNVs were reported to affect the strengths of either natural donor or acceptor sites. A variant was considered deleterious if it was predicted to affect splicing (either leaky expression or exon skipping), or if it was experimentally shown to reduce or abolish splicing of the corresponding exon. In instances where prediction and validation did not concur, the latter were used to determine the effect of the variant. Variants predicted to have a neutral effect but demonstrated to be deleterious in the validation study were classified as damaging. In total, 1010 deleterious natural splice site variants were analyzed (Supplementary Table 4).
Sequence conservation has long been considered a surrogate measure of evolutionary constraint and, by inference, functional significance. The average information quantitates the relative conservation at each of the positions within a binding site. We compiled the mutation spectra for all mutations that significantly affected the strengths of donor and acceptor splice sites and compared these with the average information contents at each position. The panels in figure 1b respectively indicate, at each position of the natural acceptor and donor sites, the frequencies of variants deemed deleterious by information analysis. Interestingly, when the sequence logo is overlaid with the histogram of the corresponding mutation spectra, the relative frequencies of deleterious mutations and the average information are comparable. Indeed, these frequencies and the information contents across each type of site are strongly correlated (r=0.95 for acceptors and 0.89 for donors). Our interpretation is that the susceptibility to deleterious mutation at a position is related to its overall conservation within the splice site, which reflects the contribution of that ribonucleotide to the stability of the interaction with the corresponding spliceosome. Nevertheless, there is an unstated bias in ascertainment in these mutation spectra. Variants occurring at sites with low information and/or that are benign are underrepresented, as they are less likely to be associated with genetic disease, and were less likely to be reported. Also, the distribution is dependent on the region sequenced by the authors of the reviewed publications; in early work, the full sequence interval containing the entire splice site was sometimes not included or unavailable for analysis.
An interactive website was created to summarize this set of SNVs. This software application renders interpretations of variant effects in a more practical, useful way than the corresponding table of supplemental data (Supplemental Table 10). The "Splicing Mutation Calculator" (SMC; http://splicemc.cytognomix.com ) is a web service that amalgamates all published results for the same type of substitution in a natural splice site, regardless of genic context. Variants that create cryptic splice sites were not included, because we consider these cases to be sequence-specific as opposed to positional. With this program, users have the option of exploring mutation data (at present, only SNVs can be analyzed) linked to the original literature citations. SMC processes and provides literature support for the variants that occur within the defined regions spanned by natural splice sites. The user first selects the type of site (donor or acceptor), position (based on ASSEDA's integer-based system), wild-type or reference nucleotide, and the alternate substitution at that position ( Figure 2a). The software tool outputs the ΔR i and the number of variants that have been reported and analyzed to date using IT (Figure 2b). SMC provisionally classifies the reported variants based on the degree to which these predicted  Table 10).
effects are expected to decrease spliceosomal affinity, and consequently splicing. The following criteria are empirically based on affinity changes and a summary of published phenotypes associated with these changes: "Deleterious" (if the site is weakened by more than 7.0 bits), "Probably Deleterious" (if the site is weakened such that -4.0 bits ≥ ΔR i ≥ -7.0 bits), "Leaky" (the site is weakened such that -1.0 bits ≥ ΔR i ≥ -4.0 bits), or "Benign, probable polymorphism" (if the site is weakened by less than 1.0 bits). In this first release of SMC, we have omitted "benign" variants, which are likely polymorphisms; these will be catalogued and included in a later version. It is important to appreciate that the ΔR i is a constant for a specific nucleotide change at a specific position, though the absolute strength of the splice site depends on the sequence context of the mutation. This context varies between mutations, and R i,initial is not the same for each case, which can result in different R i,final values for different mutations.
Besides published sources, the software also can predict effects of mutations by computing ΔR i values directly. Particular substitutions that have not been reported in Supplemental Table 10 can nonetheless be provisionally interpreted. The ΔR i value is computed and reported from the ribl. While SMC enables rapid exploration of results for validated and novel mutations, it is, however, not a replacement for ASSEDA or the Shannon Pipeline, since it does not consider the sequence context, which can also influence the interpretation of deleterious, leaky, or benign variants.

Minimum splice site information content and exceptions
The minimum theoretical information content of a binding site, R i,min , is zero bits 18 . Comparison of the R i, values of a series of inactivated and minimally active splice sites revealed the minimum strength of functional splice sites (R i,min ) to be at least 2.4 bits for the original donor and acceptor models of Stephens and Schneider (1992) (based on 103 mutations with functional validation, including 57 natural and 46 cryptic site activating mutations) 22 . This value was redefined based on information models from a genome-wide set of donor and acceptor models (Figure 1a) to be 1.6 bits using the identical set of mutations 30 . It is likely that the differences between these values are not significant and are attributable to the increased precision of the ribl using the ~50-fold larger set of sites. Weakened natural sites, with significantly reduced R i values that remain above these thresholds, are considered to be leaky (lower affinity binding), whereas those below this threshold are found to completely abolish natural splice site recognition, resulting in either exon skipping or activation of neighbouring cryptic splice sites. However, these outcomes are not mutually exclusive, since leaky splice site mutations may also result in exon skipping and/or activate neighboring cryptic sites. Natural splice sites below these thresholds are extremely rare, and their recognition is likely enhanced through the binding of specific RNA binding proteins that promote exon definition (eg. XPC exon 4 acceptor and MYBP3 exon 12 acceptor 47,48 ).
Leaky natural sites have R i values exceeding the R i,min threshold, which, in theory, retain some capacity to be recognized by the spliceosome. There were 84 variants predicted to cause leaky splicing, of which 19 were shown experimentally to lead to exon skipping without any detectable residual natural splicing (Supplementary  Table 2 5,360,379,409), and three are donor mutations at position +4 (ΔR i ~ -2.6 bits; #120, 365, 573). The R i,final values of these 19 inactivated natural sites range from 2.7 to 8.8 bits, which suggests the possibility that the variant may also simultaneously affect other adjacent or overlapping sites that preclude recognition of the mutated natural site. Additionally, weakening of 11 of these variants activates a neighbouring cryptic splice site, in which no residual natural splicing was detected. However, changes in splice site preference due to small changes in binding affinity within exons are probably related to the processive nature of donor splice site selection 49 .
Leaky splicing mutations are readily detected when the expressed transcript contains the causative variant or a neighbouring polymorphism. However, there are a number of practical limitations on the methods for experimental validation of leaky splicing mutations. RT-PCR alone would only be considered reliable for confirmation of homozygous mutations (and in one case, a compound heterozygote where two separate variants abolished natural splicing of the same exon), unless combined with a secondary quantitative methodology 50 . Similarly, it is difficult to assess leaky splicing of heterozygotes using RNAseq data, as reduced levels of wild-type splicing are challenging to determine without adequate read coverage and controls for comparison. However, leaky splicing can be assessed by comparing the frequency of the causative allele to the normal allele in the same cell line when the variant is present within the sequenced reads 41 . These are special cases however, as the variant itself must either be expressed within an exon or, if intronic, must lead to an activation of a cryptic site further into the corresponding intron.
We previously suggested that weaker splice sites are more susceptible to mutational inactivation relative to stronger sites 22 . In the present study, all experimentally verified variants affecting natural sites (where leaky and abolished splicing could be differentiated) were analyzed (N = 98). Variants predicted to abolish splicing (R i,final < R i,min and/or ΔR i < 7.0 bits) were filtered out, as large changes in binding affinity will essentially abolish splicing, despite remaining binding strength and regardless of initial R i value. Supplementary Figure 1 illustrates the frequency of inactivation by these variants relative to initial R i value. Variants occurring at weak splice sites (R i,initial < 4 bits) abolish splicing in 5 of 6 cases (where ΔR i < 7 bits), but are not represented as they all weaken the site below R i,min . The remaining variant slightly weakens a site where R i,initial is -0.1 bits (where ΔR i = 0.5 bits), and its recognition may be supported by SR elements 47 . Moderate strength splice sites (5-11.0) bits are inactivated in 25-60% of cases, and mutations at strong splice sites (R i,initial ≥ 12 bits) tend to be leaky (Supplementary Figure 1b).
Mutations that abolish natural sites (without cryptic splice site activation) are expected to result in a complete loss of normal splicing. However, of the 94 variants that reduced natural splice site strength below R i,min , 11 were reported to have residual normal splicing activity (Supplementary Table 2

Activation of cryptic splicing
It has been estimated that 1.6% of disease causing missense mutations can affect splicing and recent predictions suggest that approximately 7% of exonic variants in the general population may disrupt splicing, which includes cryptic splicing 57,58 . The genome is replete with pseudo (or decoy) splice sites with varying degrees of similarity to natural sites that are not recognized in constitutive splicing 59 . However, mutations that alter the strengths of either these decoys or the natural splice site of the same polarity may shift the balance of isoforms towards non-constitutive splice isoforms that predominate over or eliminate normal mRNAs ( Figure 4). Mutations can create a cryptic splicing event by creating or strengthening a site in either intronic or exonic regions ( Figure 4, Type 1), weaken the natural site while simultaneously altering an overlapping decoy site (Figure 4, Type 2), or exclusively weaken the natural site, leading to the activation of a pre-existing decoy site (Figure 4, Type 3). Although the contributions of cryptic splicing to genetic disease have long been recognized, IT analysis correctly predicts most, but not all, cases ( Figure 4). The challenges in identifying potential cryptic sites or determining activation are attributable to our incomplete understanding of the requirements for activation 60-62 , which include exon the G of the +1 position of the donor site (Supplementary Table 2: #185/750 and 1326), which is essentially invariant in functional splice sites. This suggests potential problems in IT or experimental analysis of these mutations. Surprisingly, the majority of these variants occur at the +2 position of a donor splice site and are T>G mutations, which are predicted to abolish splicing activity 41 . However, the analysis of RNAseq data for these variants showed no splicing defects (Supplementary Table 7: #1315, 1321, 1325, 1361, 1380 and 1407). One explanation is that resultant aberrantly spliced transcripts were subjected to nonsense-mediated decay (NMD) and degraded. Another possibility is that the coverage of these splice junctions is insufficient to distinguish expression of a single allele from that same allele plus the leaky splice junction. The remaining variants differ in the position within the splice site and decrease natural site strengths to between 0.9 to 2.2 bits 22,51 .
Theoretically, a site lacking the canonical G at +1 (donor) or -1 (acceptor) position of a natural site may exceed R i,min . Ozaltin et al.
(2011) and Di Leo et al. (2009) each assessed mutations at positions +1 or -1, which weaken natural splice sites to R i > R i,min , and note that these sites are predicted to be leaky 53,54 . However, this is not the sole criterion for interpreting splice site mutations using ITbased methods. The overall change in binding affinity must also be considered, as both mutated sites were predicted to have only 0.4-0.5% of the binding affinity of the corresponding natural splice sites 53,54 .

Branch-point mutations
Although branch-point site (BPS) recognition occurs independently and post-exon definition, mutations in this sequence have also been described, due to its proximity to the natural acceptor site. Following the recognition of and binding to the 5'ss (upstream donor site) by the U1 snRNP, the U2 is recruited to the 3'ss (downstream acceptor) and recognizes the BPS, resulting in the formation of the pre-spliceosome 55 . Association of U2 with the BPS is essential, as it is the first energy-requiring step, allowing for the tri-snRNP complex of U4/U6 × U5 to be recruited to the BPS, which produces a catalytically active spliceosome 56 . The BPS typically contains a conserved adenosine and a downstream polypyrimidine tract. It is located within 40 nt of the natural 3'ss, however there are reported cases where it can be up to 400 nt away.
Recognition of the BPS is thus a crucial step in proper splicing, and sequence variants can disrupt this event, impede lariat formation, and intron excision. The complete list of BPS variants analyzed using the ASSA and ASSEDA server can be found in Supplementary Table 5. The variants range in distance from 0-76 nt from the natural acceptor junction, and either weaken, abolish or strengthen the BPS. When validation assays were performed, the prediction by the server was correct in 9/11 cases. We deemed the two other cases to be partially discordant (NM_004628:c.413-24A>G and NM_005902:IVS8-55A>G). ASSEDA predicted these variants to abolish the BPS, but leaky and normal splicing was observed, respectively. The predictions are partially concordant with experimental findings because ASSEDA also predicted the existence of nearby alternative BPS, which if used, could account for the observed phenotype.
Although IT-based prediction of a variant effects on BPS has been accurate, the number of validated sites used to compute the ribl is substantially smaller (N = 20), and it is not as reliable as those used to determine R i values of natural acceptor and donor sites. Furthermore, these motifs are short and relatively frequent in unspliced mRNA. One possible explanation for the rarity of BPS mutations is that compensatory, alternative BPS sequences can be recognized and used. Furthermore, the weak constraint on the precision of the distance between the BPS and the 3' (acceptor) splice site (Figure 3) further enables activation of these alternative sites. These factors increase the chance that a variant will be falsely predicted to affect a BPS. For example, variants within donor splice site sequences are routinely predicted to alter strength of false BPS. This error is easily avoidable if the potential recognition sequence is filtered for the genomic context of the variant, as well as its proximity to acceptor splice sites.  length, processivity of donor site recognition, and involvement of splicing regulatory factors. A database of aberrant 3' and 5' splice sites has been compiled 62 .
Another bioinformatic method for cryptic site recognition relies on a training set composed of cryptic sites that are known to be used 63 . There are a number of drawbacks to this approach: the training set is itself not representative of all cryptic sites; and sites that are altered but unused cannot be discriminated from those that are activated (since the latter group also depends on the strength of the corresponding natural splice site). IT-based methods rank cryptic and cognate natural site strength in a way that predicts whether the site will be activated, as well as the abundance of each pair of splice isoforms. Furthermore, the structures of the prospective isoforms are presented by ASSEDA with relative quantitation of each, both prior to and post-mutation.
During our review, we noted 203 variants with experimental support for cryptic splicing (Supplementary Tables 6-8). Of these, 38 variants resulted in Type 1 cryptic splicing. From those, site activation (existence of the site and strength ≥ 2.4 bits 22 ) was correctly predicted by ASSEDA in 34 cases (89.5%). We identified 56 variants resulting in Type 2 splicing, 38 of which (67.7%) were accurately predicted, while the remaining 119 variants resulted in Type 3 cryptic splicing and 99 (90.8%) of the alternate splice sites matched predictions.
Prediction of Type 3 cryptic splicing was more accurate than Types 1 or 2. The criteria for concordance with experimental data were that ASSEDA predicted both the cryptic site and that the variant weakened the natural site. However, the strength of a site is not the sole determinant of whether or not a site is activated. Unlike natural sites, novel cryptic sites are not under selection to maintain binding to the spliceosome, and their genomic context is less constrained than natural splice sites. The presence of cooperative splicing enhancer or repressor elements adjacent to cryptic sites, which could influence cryptic splice site activation, is not yet predictable. Additionally, many of the reported activated cryptic sites have been confirmed using non-quantitative approaches, and these may not constitute the predominant splice forms relative to constitutive exons with stronger natural sites. Finally, certain isoforms may not be detected; as aberrant transcripts are often subject to degradation and the tools used to evaluate functional splicing consequences do not always have sufficient resolution to distinguish small differences in isoform structure. All of these factors can affect the concordance of predicted cryptic site activation with experimental validation.
We also separated each sub-group of cryptic splice variants by location (intronic vs. exonic) and computed the average difference in strength between pairs of natural (post-mutation) and the activated cryptic sites. For intronic Type 1 variants, activated cryptic sites were 0.86 ± 5.28 bits stronger than the corresponding natural site (N = 12). There were eight Type 1 variants (4 at acceptors and 4 at donors) that were missed, because the R i,final value of the natural site exceeded the strength of the corresponding cryptic site by ≥ 1.0 bits (variants with ΔR i < 1.0 bits are not reliably detected experimentally). We hypothesize that these cases could be explained by concomitant changes in surrounding regulatory binding site sequences. Exonic Type 1 variants were often slightly weaker than their cognate natural sites (-1.1 ± 3.8 bits; N = 26). Nearly all of these involved ectopic donor site activation (12 of 13), consistent with a processive mechanism for donor site recognition, which searches downstream from the acceptor splice site to the first donor site of sufficient strength to form an exon 35 . The opposite pattern was observed with intronic Type 2 cases, in which 20 of 21 exceptions occurred at acceptor sites. On average, the activated cryptic site exceeded the strength of the cognate natural site (1.3 ± 4.6 bits; N = 57). Activated, exonic Type 2 acceptor cryptic sites tended to be weaker than their natural site counterparts (-2.2 ± 3.3 bits; N = 4). This result may be attributable a low sample size, with 2 of these mutations exhibiting natural sites that were stronger (≥ 1.0 bits) than the corresponding cryptic site (1 donor and 1 acceptor). Finally, Type 3 activated intronic cryptic sites exhibited the greatest difference between the strengths of cryptic sites and cognate natural splice sites (6.3 ± 4.9 bits; N = 104). This category contained the fewest number of exceptional cryptic sites, with R i values less than those of natural sites (5 acceptors and 3 donors). This is consistent with the idea that the intronic cryptic sites are generally not under selection for adjacent functional regulatory binding sites, and, in order to be activated, are required to be substantially stronger than the natural site. Although R i,final values were stronger (2.1 ± 1.9 bits; N = 20) than the natural site, exonic Type 3 cryptic splice sites did not show as great a difference in strength with a single exceptional case (of an acceptor). Despite these exceptions, activated cryptic splice sites are generally stronger than the corresponding natural splice sites 22 .

Combinatorial effects
While functional natural splice sites and an intact BPS are integral for accurate and efficient splicing, other genetic elements have been shown to make essential contributions to exon definition 64 . Introns will often contain more than one potential splice site recognition sequence, but nevertheless, the correct natural site is consistently selected 59 . Differences among the strengths of potential sites, as determined by IT analysis, are a major, but not the sole, determinant of splice site utilization. The implication is that additional sequences within the gene are necessary to ensure specificity and precision of exon recognition. Studies of facultatively expressed alternative exon structures have revealed cis-acting sequence elements that function to enhance or repress exon recognition. These sequences cooperate with factors that recognize natural splice sites, whose sequences and relative strengths can vary considerably. Depending on their context, these elements have been referred to as exonic splicing enhancers (ESEs), exonic splicing silencers (ESSs), intronic splicing enhancers (ISEs) or intronic splicing silencers (ISSs). In general, these elements serve as binding sites for trans-acting elements, which will either promote or impede the spliceosomal recognition of a splice site. The majority of enhancer elements will act through the recruitment of SR proteins and associate components of the U1 and U2 spliceosomes 65,66 . Silencers are often of the hnRNP class, which act through a diversity of mechanisms including steric hindrance, the formation of dysfunctional complexes, or blocking processiveness 67-69 . To add to the complexity of splicing regulation, it has recently been shown that SR protein function is dependent on context, i.e. whether the corresponding binding site is intronic or exonic 70,71 .
To improve accuracy of exon definition, the strengths of regulatory elements (i.e. their R i values) have been incorporated into splicing mutation prediction. The significance of regulatory elements in disease has been demonstrated in many cases. For example, in the NF1 gene, ESE disruption is the primary cause of exon skipping 72 . Many other genes, including APC, SMN, BEST1, PDHA1 73-76 have been proven to harbour variants that disrupt ESEs and have a confirmed impact on mRNA splicing.
Adding to the complexity, the recognition sequences for these RNA binding factors, while well defined, tend to be short, and can vary to the degree that the same sequence may contain overlapping elements of binding sites for multiple factors. However, this does not necessarily imply that such a sequence is bound with similar affinity by each factor or that it contributes to exon definition. At the same time, these sequences tend to be evolutionarily conserved and may be required for proper splicing 77,78 .
ASSEDA optionally incorporates PWMs for regulatory binding sites for mutation analysis (Table 1) in addition to the default donor and acceptor sites. The program selects the most proximate predicted ESE/ISS to the natural splice site when calculating R i,total . The molecular phenotype, which dictates the splice isoforms that are predicted and their relative abundance, accounts for both the potential effect on the natural site and the most relevant splicing regulatory site. For these regulatory binding sites, a second gap surprisal term specific to the ESE/ISS of interest is applied to the R i,total calculation 35 . The gap surprisal functions for SF2/ASF and SC35 have been previously described 35 , where the most common distance of the ESE/ISS is within 10nt of the natural site. The gap surprisal penalty gradually increases with distance from the natural site. Gap surprisal distributions for ELAVL1, TIA1 and SRp55 show a similar pattern, while hnRNPA1 and PTB binding sites are strongly clustered around splice junctions. It should be feasible to include the contributions of multiple splicing regulatory binding sites of the same or different RNA binding proteins in determining R i,total ; however this capability had not yet been implemented. Currently, if multiple sites of the same type are altered, the strongest (before or after mutation) is chosen by ASSEDA software.
Although the disruption of splicing regulatory sequences can cause aberrant splicing, the interpretation of variants affecting these sites is not as straightforward. Due to their degenerate nature, short sequence, and a lack of understanding of the context of their use, altered regulatory sites should be functionally validated before being deemed pathogenic 7 . Using variants from a number of different studies, ASSEDA accurately predicted experimentally determined changes in binding at a splicing regulatory site 75% of the time (N = 12) 35 . However, there were instances where regulatory sequences had been analyzed by IT, and considered to contribute to disease, but the results were not reproducible. For example, Kölsch et al. (2009) described SNPs associated with Alzheimer's Disease, one of which strengthened and created SRp40 and SRp55 sites, respectively, but were reported by authors to be abolished 94 . This study did not report any evidence to support the significance of these predictions.
Functional validation of the effects of these mutations could contribute to understanding the roles of these factors in regulating constitutive splicing. Similarly, there is still little understanding on how multiple regulatory binding sites within the same region function as a unit. Using a pull-down assay, Olsen et al. (2014) demonstrated how different variants affect the binding of multiple regulatory proteins. One mutation was predicted to create and strengthen multiple hnRNPA1 sites and slightly strengthen an SF2/ASF (SRSF1) site. The pull-down studies showed up-regulation of hnRNPA1 binding and a decrease in SF2/ASF binding. However, SF2/ASF binding increased when a mutation disrupting hnRNPA1 affinity was introduced, suggesting that the strong hnRNPA1 sites outcompete the weaker SF2/ASF site.
In some instances, alterations in regulatory splice site recognition sequence and natural splice strength occurred concomitantly, with both predicted to have similar effects on splicing. Alteration of a regulatory sequence can sometimes provide a plausible explanation for discordant in silico prediction and experimental validation. As an example, Smaoui et al. (2004) analyzed a donor site mutation (NM_001040667:c.1327+4A>G) in HSF4 in a family with congenital cataracts 50 . This variant was predicted to cause leaky splicing (R i,final = 5.4 bits; ΔR i = -2.6 bits; 67.5% residual binding), however RT-PCR showed complete exon skipping. Our further analysis showed that it is predicted to also create an overlapping hnRNPA1 site (R i,final = 4.2 bits; ΔR i = 17.1 bits). Another case involved a mutation in the XPC gene (NM_004628:c.2033+2T>G) that created a novel intronic cryptic site 4 nt downstream of a natural donor site 95 . However, a weaker site 68 nt downstream from the natural site was activated. A possible explanation could be that activation of the cryptic site is influenced by a neighbouring hnRNPA1 site that is itself strengthened (R i,final = 5.2 bits; ΔR i = 2.2 bits) and an SRp55 site that is significantly weakened (R i,final = 1.9 bits; ΔR i = -4.0 bits).
The effects of changes in regulatory binding site strengths may ascribe potential functions to previous VUS. For example,  23 . The natural donor site was unaltered by the mutation. As indicated earlier, the mutation was confirmed experimentally to increase hnRNPH and hnRNPA1 and decrease SRp40 and SF2/ASF binding.

Validation of results
A number of early mutation studies did not perform expression analysis and relied solely on the ASSEDA or ASSA server to interpret potential mutations. This is not recommended, as there are limitations to any in silico predictive method, which impacts accuracy and precision of the prediction. Assuming that the impact of the mutation on expression can be detected, experimental validation of IT-based mutation analysis can reveal its limitations. We describe the various validation methods that were employed in the articles where expression data were available. Below, advantages and disadvantages of these approaches are explored, as well as how lower sensitivity validation can result in misinterpretation. Finally, we determine the accuracy of IT-based prediction, and point out some instructive, discordant cases.

Validation methods
The two most widely used methods for validating mutant mRNA splicing isoforms have been RT-PCR analysis of patient mRNA, and transfection of minigene constructs expressing the mutated exon into cell lines, followed by RT-PCR. These assays were, in some cases, accompanied by other techniques such as direct sequencing of cDNA, Western blotting, luciferase expression assays or immunostaining. A number of studies used quantitative RT-PCR or real-time PCR to estimate isoform abundance. RNA or cDNA sequencing and exon expression microarrays were also used in several studies to support in silico predictions. Certain functional assays that we reviewed were unique to a single study, including: allelic instability, exon trapping, immunoprecipitation of splicing factors, and flow cytometry 23,97-99 . Other indirect methods of justifying the association between a splice site variant and disease included fundoscopy, loss of heterozygosity, blood protein levels, and segregation with disease 100-103 . Because a variant may result in aberrant splicing but might not be accompanied by a detectable phenotypic change, we excluded the results of indirect assays of phenotype. Indirect measures of phenotype can support disease association, but do not inform about accuracy of splicing prediction.
Endpoint RT-PCR and minigene assays probe the specific variant in question, but do not reveal relative abundance of each isoform, whereas qPCR does. Neither method resolves mRNA sequence at the nucleotide level, which can fail to confirm predicted splicing mutations, especially in instances where a small number of nucleotides are retained at the constitutive splice junction 104 . The resultant frameshifted mRNAs can cause premature truncation of the transcript (PTC), instability, and NMD, leaving no evidence of the mutated isoform (unless the cells had been treated with an NMD inhibitor). A disadvantage is that in cases where the protein is not degraded, but still impaired or dysfunctional, the result will be incorrectly categorized as benign.

Regulatory sequence variants
A number of assays have been developed to confirm direct effects of variants on splice site recognition, however fewer methods are available to measure effects of mutations at binding sites of splicing regulatory proteins 113 . The most reliable approach is to associate a change in splicing with a change in regulatory protein binding. A combination of electrophoretic mobility shift assay and RT-PCR were used to confirm that a predicted change in an SF2/ASF binding site caused exon skipping in the CFTR gene 114 . Others performed RNA affinity purification in combination with Western blotting 23 .
Another approach tests multiple variants at the same position through minigene assays. Anczuków et al. (2008) observed that two variants at the same position in BRCA1 (c.3600G>T and c.3600G>C) predicted different effects on regulatory sequences, as well as different observed effects on splicing 115 . The G>T variant predicted abolition of a SRp40 site and weakening of an SF2/ASF site by both ASSA and ESEfinder, and showed a significant reduction in the relative amount of normal transcript. The G>C variant, which did not elicit a change in splicing, was not predicted by ASSA to have a significant effect on either site (although ESEfinder predicted weakening of the SRp40 site below its default threshold). The difference in splicing efficiency could be due to the loss of binding by one or both of these regulatory proteins. This assay associates predicted changes to regulatory protein binding site strength to changes in splicing. A direct binding assay would lend key support for such predictions.

Accuracy of IT-based prediction
We previously evaluated the accuracy of IT-based prediction using a set of validated splicing mutations (85.2%; N = 61) 25 . Other studies have also evaluated the accuracy of ASSA/ASSEDA while evaluating differences between multiple predictive programs and have shown varying levels of concordance (68.8%, N = 16; 90.1%, N = 22; 100%, N = 24) 51,104,116 . With a comprehensive list of all published variants analyzed using IT-based methods (Supplementary Bibliography), we perform a meta-analysis of all of these variants to minimize bias in interpretation and impact of ascertainment of specific phenotypes from individual studies. The list of variants is more extensive than any previous study examining accuracy of ITbased methods. The variants are not restricted to a single or even group of diseases, but rather cover over 150 different conditions (see Supplementary Table 2).
In total, 905 variants were reported in 122 different publications to have been validated for their effects on splicing (1,727 total variants analyzed from 216 papers -Supplementary Table 9). In all cases, the authors performed information analysis; however, the validation experiments were sometimes contained in the original reports and in other cases, later studies. In a minority of mutations, the validation results were either uninformative (N = 36) or the methods did not directly imply an effect on splicing (N = 2); these mutations were therefore excluded in determining the accuracy of predictions (shaded in grey in Supplementary Table 9).
More specifically, in order for experimental results and predictions to be considered concordant, one or more of the following criteria had to be met: a. A variant predicted to abolish a splice site did abolish splicing, with no residual splicing observed. Exceptions to this were assays in which both the mutant and wild-type alleles were expressed in the same cell line or patient sample, and could not be discriminated from one another (i.e. RT-PCR); b. A variant predicted to be leaky exhibited residual normal splicing, with the exception of cases where a much stronger cryptic splice site was activated; c. A variant that strengthened the natural site and showed normal or increased levels of the wild-type isoform, consistent with it having a benign phenotype and/or polymorphic; d. A variant predicted to activate a pre-existing splice site, while also reducing the natural splice site strength, was demonstrated experimentally to result in cryptic splicing, regardless of whether it was predicted it to be the predominant isoform; e. A variant predicted to affect a splicing regulatory protein-binding site was consistent with validation experiments explicitly assessing binding affinity and associated splicing alterations.
Cumulatively, 87.9% of variants documented by expression studies (762 of 867) that satisfied these criteria were accurately predicted by ASSEDA. A minority of papers reported variants to be "partially concordant" (3.1%; 27/867), meaning that while the cryptic site observed was predicted, it was not the most likely splice isoform relative to other expressed cryptic exons. Because this method of scoring met our criteria (see point d above), we included these in our determination.

Predicted mutations discordant with validation results
Limitations of both the predictive model and the validation data/ methods were the primary reasons for discordance. Where information analysis predicted a neutral change or no effect, but validation showed aberrant splicing, we hypothesize that there are either unrecognized splicing regulatory protein binding sites that are weakened or abolished, or that there are underlying mechanisms that are not currently addressed by current information models 23,35,50-52,54,96,98,99,114,117-127 . The validation methods used can also contribute to discordant results. We note that 41 discordant results originated from one of our own studies 41 . This study used RNAseq data to validate predictions, a genome-wide approach that should be used with caution when inferring changes resulting from potential splicing mutations. Until this study was published, ITbased mutation analysis was based on single or candidate disease gene studies. RNAseq reveals all changes in transcript levels for all genes, which although potentially relevant to splicing, may not necessarily contribute to the phenotype in question. This leads to the possibility, especially in cancer phenotypes, of bystander effects (global splicing dysregulation, natural alternative splicing) that are not directly attributable to the predicted mutations. Furthermore, because the sequence reads at splice junctions are short and often limited in number, a relevant splicing aberration may result from a given variant, but it was not detectable. Finally, the predictions of IT can pick up variants that should alter splicing for example, of rare recessive alleles, that that may not have any disease relevance.

Misinterpretation of variant effects
While preparing this review, several variants misinterpreted with ITbased tools were noted. These variants have been re-analyzed to disseminate the correct findings and to avoid making similar errors in the analysis of newly discovered variants. Supplementary Table 2 contains these results. The most common problems result from unfounded emphases on altered or pre-existing cryptic sites that are determined to be significantly weaker relative to the cognate natural site 109,128-132 , and from selectively reporting a single change in the R i value when, in fact, multiple significant changes can be detected 48,128,133-136 . An example of the first type of error is exemplified by a variant in CGI-58 (ABDH5), where the natural splice site is 9.1 bits (or ≥ 549-fold) stronger than the reported cryptic site 129 . Henneman et al. (2008) selectively reported the effect of a mutation that weakens a natural donor splice site in APOA5, however only a change in the information content of an SC35 binding site was indicated.
Other common problems include incorrect declaration of small  144,146 . Strong, pre-existing cryptic sites outside of the default sequence analysis window (54 nt circumscribing the mutation) have also been missed because the window was not expanded to include these sites 147 . Although the predicted isoform structure generated by ASSEDA will, by default, display skipping for mutated natural sites with ΔR i ≥ -7.0 bits (or ≥ 128-fold) 116 , smaller decreases in natural site strength of an internal exon can sometimes partially induce exon skipping. This value is adjustable, and it may be advisable to explore different thresholds depending on the particular susceptibility of a splice junction to exon skipping. Sharma et al. (2014) used the default threshold from ASSEDA to interpret CFTR mutations c.2988G>A (9.6 to 6.6 bits, natural donor site of exon 18) and c.2657+5G>A (9.1 to 5.6 bits, natural donor site of exon 16), but exon skipping was documented. IT analysis was not discordant for these variants, which significantly weaken the corresponding splice sites by ≥8and 11-fold, respectively, and has been shown in other genes to lead to exon skipping, leaky splicing, or both of these outcomes. Aissat et al. (2013) tabulated variants that were predicted to affect strengths of ESE binding sites, but did not comprehensively report all findings even though predictions by ASSA and ESEfinder were concordant (eg. CFTR: c.1694A>G). Alternate mutation entry methods, which do not use contextual gene name annotations, such as entry by rsID, report predicted binding changes on both strands. A report indicating abolition of SRp40 binding sites on the antisense strand was confused with binding sites for CYP46A1, which is transcribed from the sense strand 148 .
Other problems include inadvertent mislabelling of splice site type or location 149-151 , interchange of the terms information content and change in information (R i and ΔR i ) 122 , and unclear variant interpretation (i.e. "run on into the intron") 152  Similarly, variants that strengthen natural splice sites [186][187][188] are also likely to be neutral, though these variants can increase retention of exons that are otherwise frequently alternatively spliced 189,190 . However, binding site variants with minimal splicing information changes may still alter mRNA processing by disrupting mRNA secondary structure 191 .

Polymorphisms and splicing
Early studies suggested that common polymorphic sequence variations at splice sites corresponded to small ΔR i values, consistent with these changes having little impact on mRNA abundance 22 . More recently, it has been appreciated that certain rare SNPs have significant genetic loads, can actively alter mRNA splicing profiles, and lead to non-obvious splicing phenotypes 58, 189 . Nevertheless, it is not uncommon for reports to solely analyze novel variants and ignore known SNPs 136,156,158,192 , or limit results only to those that occur in the vicinity of natural splice sites 184 . We find that 56.4% of common SNPs (with population frequencies ≥ 1% in Supplementary  Table 2: #1291, 1296, 1431, 1435, and 1436). These include rs10190751 in CFLAR, which modulates the production of two short isoforms, and is associated with an increased risk of lymphoma 189,193 , rs3892097, which alters exon inclusion in CYP2D6 30 and leads to a non-functional protein and altered drug metabolism 194 , and rs1805377 in XRCC4 34 , which has been associated with oral cancer susceptibility 195 and increased risk of gliomas 196 . There is also experimental support for common SNPs that have been predicted to affect splicing 22,98,107,110,114,118,149,164,189,197 . For example, experimental evidence for increased exon inclusion has been described for three of six SNPs that increase strength of natural splice sites 189,190 . Numerous common SNPs, which were either deemed neutral or predicted to affect splicing, have not been confirmed experimentally 14,22,25,34,52,94,99,131,133,144,145,148,151,165,166,168,179,183,185,188,[198][199][200][201][202][203][204][205][206][207][208] .
Polymorphisms with significant information changes should be investigated, as they may not be completely benign and can have a significant impact on mRNA splicing.

Inference of variant pathogenicity by IT analysis
Recently, American College of Medical Genetics and Genomics recommendations for reporting incidental findings in sequencing have suggested that bioinformatic predictions are not sufficient to declare clinical significance 209 . Preceding the publication of these guidelines, numerous peer-reviewed articles suggested variants analyzed by IT to be causative/pathogenic/disease-causing, without confirmation of the predicted splicing effect 101,135,137,150,160,167,178,205,[210][211][212][213][214][215][216][217][218]  These authors recommended that the outcome of 3 programs was sufficient for analysis, unless all three predictions were discordant from one another (2 for false positive predictions). Despite the obvious appeal of consensus between different analytical methods, a major caveat in using polling strategies for mutation assessment is that these approaches are prone to both systematic and sampling errors 40 .
We summarize results of 36 publications that used both IT-based prediction tools and one or more alternate prediction tool (14 for 5' and 3' splicing, six for splicing regulatory proteins) to assess mutations 23 The analysis performed by the authors allowed us to compare the similarity of predictions to those programs and IT in Table 2a and  Table 2b. Those most commonly used for 5' and 3' splice sites (NNsplice, MES, NG2, HSF and SSF) were highly concordant for natural sites (85.4% for donor and 77.6% for acceptor sites; Table 2a). Discordance of acceptor predictions may be due to methodologies that do not analyze the complete acceptor site (HSF analyzes only 14 intronic nucleotides upstream of acceptor splice sites) 236 . Some programs (SSF, HSF) exhibit greater concordance with IT for cryptic splice site prediction (96% for donor and 76.9% for acceptor sites). The level of discordance between IT and other commonly used software programs (59.5% for donor and 60% for acceptor sites) may be attributable to the empirically-derived scoring thresholds and the validation sets used to predict mutated splice sites. Models that are typically built (or trained) using known natural splice sites may be less sensitive for differentiating true cryptic splice forms from decoys in the genome, which tend to be weaker than natural splice sites. Tools are highly consistent when analyzing variants expected to be neutral with respect to splicing (100%; N = 71). Colombo et al. (2013) compared nine programs to evaluate accuracy in predicting mRNA splicing effects and reported that ASSA, along with HSF, demonstrated 100% informativeness and specificity.
ASSEDA has also been used to analyze RNA binding proteins that enhance or silence exon recognition (Table 2b). ESEfinder was used for 42.2% of these mutations in one or more regulatory binding  Table 2b). The discordance with ESEfinder may be associated with differences in the respective analytic methods, as several of the models (SF2/ASF, SC35, SRp40) used by ASSA and ESEfinder were created from the same source of experimental data 87, 239 . While the majority of the discordant results were cited in a single study 114 (5/6 variants), the small size of the dataset (ranging from 28-34 sites) may artificially exacerbate differences between these results. In multiple instances, ASSA has been used to analyze SR proteins, but other programs were used to analyze 5' and 3' splice site mutations 23,99,115 . This was surprising, since the donor and acceptor R i values are generated by default by ASSA and ASSEDA. The advantage of performing both constitutive and regulatory splice site analysis with IT is that all results are reported on the same scale, and the strengths of all interactions, and effects of mutations are directly comparable to one another.

Other applications of information theory-based splice site analysis
The use of IT to analyze splicing is not limited to sequence variant analysis. The natural and alternative splicing of several genes have been characterized using this method 107, 200,240 . Khan et al. (2002) scanned all natural sites in the XPC gene and found a weak acceptor (-0.1 bits), and with RT-PCR found that this exon (exon 4) was skipped to a greater extent than another (exon 7), which possessed a strong acceptor, illustrating a relationship between the information content of a natural splice site and its level of alternative splicing. IT has also been used in genetic engineering in the design and alteration of binding sites, and has been used in the design of constructs for transgenic animal models [241][242][243] . Thus, IT-based splice site analysis can be adapted for other important molecular genetic applications.

Guidelines for information theory-based splicing mutation analyses
Our comprehensive review of the use of IT in splicing mutation analysis has led us to propose general recommendations, which we formulate as guidelines. Adoption of these guidelines should ensure the accurate and comprehensive results from IT analyses of VUS and other pathogenic variants that alter mRNA splicing.

Report gene isoform and genomic coordinates
When analyzing a variant with ASSEDA, the user is prompted to select an mRNA isoform (GenBank or RefSeq accession) from the gene in question. When entering the same variant (in either IVS or c. notation) for different isoforms, either the variant will parse one but not the other representation, or the variant syntax will be processed for both. In the first situation, the user is prompted to verify the position and substitution, which may elicit the realization that the incorrect isoform had been selected. However, in the case where the variant can still be parsed (despite being incorrectly entered for the isoform selected), an incorrect nucleotide may coincidentally have the same sequence, and the user may not necessarily realize that the intended position is not being analyzed. We were unable to reproduce results for several variants, because the mRNA or gene isoform was not reported. This issue could be resolved by comparing the genomic sequence in papers where the context of the mutation was included 50 We note that ASSA/ASSEDA cannot account for genes with redacted exons, where the exon numbering or sequence in the original mRNA accession has not been corrected. A well-known example is BRCA1, for which the constitutive isoform lacks the exon designated as number 4. IVS notation beyond this point in this gene must be reduced by one intron. Alternatively, one of the HGVS-approved methods can be used to input variants, or the variant can be designated with the genomic coordinate (g.) format. Review of ASSA/ ASSEDA output (coordinates and/or the sequence walker 20 ) is a prudent approach to confirm that the correct region has been analyzed. , which may not have properly reported predicted cryptic sites, as they did not report any cryptic sites predicted by ASSA beyond the default window size (54 nt) from the mutation. 2 -predictions made using the SSF-like algorithm in the Alamut splicing prediction module were combined with the SSF category (SSF is no longer supported). 3  To eliminate ambiguity, we recommend that reported variants be accompanied by the accession number used in its analysis (consistent with HGVS notation 36 ) and the genomic coordinates with the corresponding reference genome build. The table of results from ASSEDA or Shannon pipeline output could also be included as supplementary published material. This will ensure that reported results can be reproduced and compared to other experimental or in silico results.

Report R i values
The results generated by IT software provide R i,initial , R i,final , and ΔR i for donor and acceptor sites by default, and for all other ribl matrices selected. Reporting these values along with the interpretation improves the clarity of said interpretation. Several publications have not reported R i , and instead only the interpretation of these values 125,138,146,212,227,251,252 . This presumes that the analysis was performed correctly, and accurately interpreted. In one instance, our reanalysis differed from the published interpretation 138 . Other publications provide R i values, but were incorrectly reported, which resulted in misinterpretations 48,122,253 . Simply reporting ΔR i itself does not provide sufficient information about the context of the mutation or possible cryptic splice sites, which is necessary to fully appreciate the resultant effect on splicing 136,245,254 . We recommend R i values be provided for each variant analyzed. We also suggest that the specific donor and acceptor ribl used for variant analysis be indicated, because of the differences obtained using the genomewide and original PWMs in IT analysis 30,33 . The distinction can also be significant, when the R i,final value of a mutated splice site approaches R i,min .
Consider impact of missense and synonymous mutations on mRNA splicing Missense and synonymous mutations can alter natural splicing, create cryptic sites, and alter crucial ESE and ESS binding sites 255 . IT tools have been employed to analyze exonic variants that strengthened or create exonic cryptic sites, which were also confirmed experimentally 25,39,41,43,98,105,116,124,130,149,151,178,256,257 . Similarly, IT tools can predict potential effects on strengths of SR and hnRNP protein recognition sites 23,117 . There is no justification for cataloguing intronic and exonic variants, but only assessing splicing effects for the intronic variants or those within natural splice sites 119,132,175,186,208,210,214,215,248,258,259 . We recommend that IT-based analysis should evaluate all variants within a gene for potential splicing mutations.

Experimentally validate variants
Many studies have reported only coding changes and the results of IT (or other in silico) analyses without experimental validation. Our review indicated that IT-based splicing predictions are highly concordant with validation results (87.9%) Nevertheless, the discordant mutations support the need for robust post-prediction validation, since even a single discordant result can lead to misdiagnosis. We do not detect any consistent pattern amongst the discordant predictions to provide guidance as to which IT analyses will be erroneous. Experimental verification will mitigate incorrect interpretations of IT predictions and has been recommended by others 26 .
Report the sequence window used in the analysis ASSA/ASSEDA allows the user to alter size of sequence window range surrounding the mutation. The default window range has been set to maximize the speed of analysis, which is to some degree dictated by the number of matrices and the length of the sequence analyzed. Arbitrary abbreviation of the sequence analysis window can result in the failure to detect activated intronic or exonic cryptic sites, which can in some instances significantly lengthen (eg. 231 and 313 nucleotide extensions, respectively 166,171 ) or shorten the corresponding natural exon. Therefore, we suggest expanding this window if one wishes to assess the possibility that long range, preexisting cryptic splice sites may be activated.
We note that unequivocal prediction of cryptic splice site use in large exons (> 1000 nt) can be challenging due to the reliance of these gene regions on splicing enhancers, silencers, and other regulatory elements to prevent ectopic splice site use and ensure fidelity of splicing 260  There are several other stronger candidate cryptic splice sites that occur between the natural and cryptic splice site, but there is no evidence that any are used in the individual carrying this mutation.
Designate genic rearrangements (insertions, deletions, duplications) with genomic coordinates Complex insertions and deletions in IVS or c. notation may occasionally be parsed to the wrong coordinates within a gene. Indels will parse properly when genomic coordinates are used. If IVS or c. notation is used, it is suggested that users confirm that the expected alteration of the mutation is correct by reviewing the sequence walker display generated by ASSEDA for all insertions, deletions and duplications. All data from the extensive review of the literature presented in the article are reported as Supplementary tables 1 through 10. The following data are provided: 1) articles referring to information theory as a tool for splice site mutation analysis; 2) complete list of reviewed variants; 3) indels, duplications and multinucleotide variants; 4) deleterious natural site variants; 5) branch point variants; 6-7) Types 1-3 cryptic splice site variants; 9) validated variants; 10) splicing mutation calculator data.

GNU GPLv3
Author contributions PKR conceived, coordinated, and directed this study. NGC compiled the literature and determined which articles were eligible for inclusion. NGC and EJM summarized the articles. NGC, EJM and PKR wrote the manuscript, which has been approved by all authors.

Competing interests
PKR is the inventor of US Patent 5,867,402 and other patents pending, which underlie the prediction and validation of mutations. He is one of the founders of Cytognomix, Inc. which is developing software based on this technology for complete genome or exome splicing mutation analysis.