BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data [version 1; peer review: awaiting peer review]

Background: Rare genetic disease studies have benefited from the era of high throughput sequencing. DNA sequencing results in genetic diagnosis of 18-40% of previously unsolved cases, while the incorporation of RNA-Seq analysis has more recently been shown to generate significant numbers of previously unattainable diagnoses. While DNA methylation remains less explored, multiple inborn diseases resulting from disorders of genomic imprinting are well characterized and a growing body of literature suggests the causative or correlative role of aberrant methylation in diverse rare inherited conditions. Complex pictures of methylation patterning are also emerging, including the association of regional, multiple specific-site or even single-site methylation, with disease. The systematic application of genomic-wide methylation-based sequencing for undiagnosed cases of rare diseases is a logical progression from current testing paradigms. Similar to the rationale previously exploited in RNA-based rare disease studies, we can assume that disease-associated or causative methylation aberrations in an individual will demonstrate significant differences from other individuals with unrelated phenotypes. Thus, aberrantly methylated sites will be outliers from a heterogeneous cohort of individuals. Methods: Based on this rationale, we present BOREALIS: B isulfite-seq O utlie


Introduction
Multiple inborn diseases resulting from disorders of genomic methylation are well characterized. A growing body of literature reports associations between DNA methylation and conditions including Parkinson's Disease (Chuang, et al., 2017) and methylation-based studies have also suggested the causative or correlative role of aberrant methylation in diverse rare inherited conditions (Guastafierro, et al., 2017;Sharp, et al., 2017;Sobreira, et al., 2017).
DNA sequencing results in the diagnosis of up to 40% of genetic disease cases previously unsolved using standard clinical testing (Sawyer, et al., 2016). RNA-Seq has been investigated and has shown benefit in complementing DNA testing (Cummings, et al., 2017). With the growing evidence linking methylation to disease, the application of methylation-based sequencing for undiagnosed cases of rare inherited disease is a logical progression from current testing paradigms. Expanded methylation profiling will offer the ability to detect diagnostic signals unique to the epigenome, undetectable in DNA and RNA due to lack of measurable manifestation in those materials, or due to shortcomings in current technologies or analytical approaches.
An ideal method to detect deviant methylation should offer the ability to profile at single CpG sites while enabling flexibility to consolidate calls across regions. In the context of genetic disease, we can assume that disease-associated methylation aberrations in an individual will show significant differences from individuals with unrelated phenotypes. A similar rationale was successfully used by us and others in outlier-based RNA analysis (Jenkinson, et al., 2020). However, existing solutions for the detection of differentially methylated CpG sites from bisulfite sequencing focus on traditional group vs group or multi-group experimental designs (Wreczycka, et al., 2017) and are therefore not suited to genetic disease or other outlier-based analyses ( Figure 1).

Methods
At a given CpG site, we assume we have data in the form of methylated counts x i and total read counts n i for individuals i = 1, …, I in a cohort of size I. If every individual in the population had the exact same probability p of methylation at this site, i.e., p 1 = p 2 = … = p i = p where p i is the (true-but-unknown) probability of methylation for the i th individual at this site, Figure 1. Conceptual differences between traditional case vs control analysis and the BOREALIS approach. While traditional approaches to differential methylation analysis are based on group case vs. control analysis, BOREALIS utilizes a one vs. many outlier approach whereby individual(s) are compared to a cohort. This approach is especially useful when multiple similar cases are difficult to identify, as is the case in rare genetic disease studies. By comparing every affected individual to a cohort of heterogeneous individuals, outlier methylation can be identified at individual CpG sites for all members of the cohort, without the requirement for multiple similar cases. then the methylated counts x i would be binomially distributed with parameters p and n i . However, we expect varying degrees of sample-to-sample variability in the probability of methylation at a given site even in a healthy cohort. Therefore, it is more biologically accurate to assume that p i for i=1, …, I have been sampled from a distribution over the unit interval. A common and mathematically convenient choice for this distribution is a beta distribution with parameters α and β. Thus the observed number of methylated reads x i for the i th individual can be viewed as being generated from a two-step process whereby the probability of methylation is selected p i~B eta (α,β) and given this probability the number of methylated reads is binomially distributed x i |p i~B inomial (p i ,n i ). Viewed in this way, we say the methylated reads are beta-binomially distributed x i~B eta-Binomial (n i ,α,β) with parameters α and β, and popular packages for differential DNA-methylation detection such as Dispersion Shrinkage for Sequencing (DSS) (Feng and Wu, 2019) use this same distribution for methylated counts.
What BOREALIS does differently from traditional tools such as DSS (Park and Wu, 2016) is that it builds its statistical model explicitly for the purpose of outlier detection compared to a cohort, which requires alternative statistical framing Figure 2. BOREALIS power analysis and single site methylation profile output by BOREALIS. Graphical summarization of Monte Carlo simulations of an outlier sample in a cohort of varying sizes and depths of sequencing coverage. Ten thousand simulations were performed for each set of experimental conditions whereby random sampling of sequencing depth for each sample and the proportion methylated reads was performed. BOREALIS built its beta-binomial model for each simulated cohort and an outlier sample was simulated with BOREALIS used to compute a p-value. Power estimation is based on the simulations correctly rejecting the null hypothesis at a level p < = 0.05. The parameters mu (mean methylation fraction at a given site), sigma (variability in methylation fraction at a given site) and muAb (deviation from the mean methylation level in the outlier sample) are fixed for the purposes of the simulation shown. and considerations as compared to group-versus-group analyses (Jenkinson, et al., 2020). Specifically, at each CpG site, BOREALIS takes the input data {(x i ,n i ): i = 1, …, I} and estimates the population-level parameters α and β using gamlss library (https://www.gamlss.com) (Feng and Wu, 2019). We implement Laplace Smoothing on the counts (i.e. we use as counts e x i = x i +1 and e n i = n i +2) as a regularization step to help deal with any samples with low counts. From these estimated α and β parameters, we (for the i th sample) compute the left-sided p-value by looking at the probability that a value of x i or fewer methylated reads were generated from a Beta-Binomial (n i ,α,β), and likewise a right-sided p-value would evaluate the probability that a value of x i or greater methylated reads came from this distribution. We implement this probability calculation using the pBB function of gamlss. The two-sided p-value is computed as two times the lesser of these one-sided p-values.
To validate performance of the BOREALIS method, we performed Monte Carlo simulations of an outlier sample in cohorts of varying sizes sequenced at varying depths of coverage. Namely, after selecting a cohort size I and an average depth of coverage D we conducted 10,000 Monte Carlo simulations wherein each sample i in the cohort has sequencing depth d i drawn from a Poisson distribution with mean D. The number of methylated reads in cohort sample i is drawn from a Beta-Binomial distribution with mean 0.8 and dispersion 0.1, and then these simulated cohort data are fit using the BOREALIS model. An outlier sample is then simulated with sequencing depth d drawn from a Poisson distribution with mean D and number of methylated reads given by a Binomial distribution with mean 0.3. BOREALIS is then used to compute a p-value thresholded at level 0.05, and the power is given by the fraction of the 10,000 simulations that correctly reject the null hypothesis. Full code to replicate the power analysis is provided as described in the Data Availability section.

Results
The results of the power analysis are plotted in Figure 2 demonstrating that BOREALIS can accurately detect outlier methylation events in modest cohort sizes and sequencing depths. BOREALIS can successfully identify outlier methylation with high statistical power utilizing a wide range of sample numbers (3-100+) and read depths (< 10 -100s). BOREALIS is also tolerant of multiple identical outliers provided sufficient sequencing depth and cohort size as shown in Figure 3. The method supports multithreaded computation, as well as splitting across chromosomes to facilitate parallelism across compute nodes in a cluster environment. To provide users with the ability to visually review the methylation distributions underlying any call made by BOREALIS, we provide a built-in plotting function whose outputs are illustrated in Figure 4. Graphical summarization of Monte Carlo simulations of (A) two outlier samples and (B) three outlier samples in a cohort of varying sizes and depths of sequencing coverage. While rare disease studies generally aim to identify single outlier individuals in a cohort, this demonstrates the ability of the BOREALIS approach to identify multiple, identical outliers in the presence of sufficient read coverage and cohort size. One thousand simulations were performed for each set of experimental conditions whereby random sampling of sequencing depth for each sample and the proportion methylated reads was performed. BOREALIS built its beta-binomial model for each simulated cohort and an outlier sample was simulated with BOREALIS used to compute a p-value. BOREALIS package vignette outline BOREALIS is packaged with a vignette that will enable new users to become quickly familiar with program outputs and potential downstream use-cases. These include topics including: 1) Running the core BOREALIS method on a cohort 2) Post-processing 3) Generating summary metrics 4) Annotating program outputs with user-defined genomic features 5) Generating visual outputs for single-site data 6) Summarizing single-site data across genomic features The vignette is available in HTML format within Bioconductor online (https://bioconductor.org/packages/release/bioc/ vignettes/borealis/inst/doc/borealis.html) or packaged with the Bioconductor package itself. This vignette provides a hands-on introduction to the package and will be beneficial for new users, prior to them developing their own specific workflows.

Conclusions
BOREALIS is a novel R/Bioconductor package that addresses an unmet need in bisulfite sequencing-based genetic disease studies. The method is suited for single or multi-site downstream analysis and can successfully identify outlier methylation with high statistical power, providing a new avenue of exploration in the quest for increased diagnostic rates in genetic disease patients. It is readily available and easily implemented, enabling seamless integration with other common pipelines and tools.  (Oliver, et al., 2022b) The project contains the following underlying data: • README.txt (Text file with instructions to regenerate power analysis data and graphs)

Author contributions
• plotFig2.R (R code to generate the graph for Figure 2 using the csv input) • plotFig3.R (R code to generate the graph for Figure 3A and 3B using the csv input) Data is under a Creative Commons Attribution 4.0 International license.