ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data

[version 1; peer review: 2 approved with reservations]
* Equal contributors
PUBLISHED 20 Dec 2022
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Bioconductor gateway.

Abstract

Background: Rare genetic disease studies have benefited from the era of high throughput sequencing. DNA sequencing results in genetic diagnosis of 18-40% of previously unsolved cases, while the incorporation of RNA-Seq analysis has more recently been shown to generate significant numbers of previously unattainable diagnoses.  While DNA methylation remains less explored, multiple inborn diseases resulting from disorders of genomic imprinting are well characterized and a growing body of literature suggests the causative or correlative role of aberrant methylation in diverse rare inherited conditions.  Complex pictures of methylation patterning are also emerging, including the association of regional, multiple specific-site or even single-site methylation, with disease. The systematic application of genomic-wide methylation-based sequencing for undiagnosed cases of rare diseases is a logical progression from current testing paradigms.  Similar to the rationale previously exploited in RNA-based rare disease studies, we can assume that disease-associated or causative methylation aberrations in an individual will demonstrate significant differences from other individuals with unrelated phenotypes.  Thus, aberrantly methylated sites will be outliers from a heterogeneous cohort of individuals.
Methods: Based on this rationale, we present BOREALIS: Bisulfite-seq OutlieR MEthylation At SingLeSIte ReSolution.  BOREALIS uses a beta binomial model to identify outlier methylation at single CpG site resolution from bisulfite sequencing data.
Results: Utilizing power analyses, we demonstrate that BOREALIS can identify outlier CpG methylation within a cohort of samples.  Furthermore, we show that BOREALIS is tolerant to the inclusion of multiple identical outliers with sufficient cohort size and sequencing depth.
Conclusions: The method demonstrates improved performance versus standard statistical testing and is suited for single or multi-site downstream analysis.

Keywords

methylation, outlier, rare disease, diagnostic odyssey

Introduction

Multiple inborn diseases resulting from disorders of genomic methylation are well characterized. A growing body of literature reports associations between DNA methylation and conditions including Parkinson’s Disease (Chuang, et al., 2017) and methylation-based studies have also suggested the causative or correlative role of aberrant methylation in diverse rare inherited conditions (Guastafierro, et al., 2017; Sharp, et al., 2017; Sobreira, et al., 2017).

Methods for profiling DNA methylation traditionally focused on wide genomic regions, particularly CpG islands. While microarray methylation profiling continues to be utilized, next-generation sequencing has enabled analysis at read count-level resolution and greatly increased numbers of genomic CpG sites (Wu, et al., 2015). High-resolution methodologies have enabled complex pictures of methylation to emerge, including association of multiple specific-site or even single-site methylation with developmental processes or disease (Bui, et al., 2012; Choi, et al., 2018; Claus, et al., 2012; Fürst, et al., 2012; Hashimoto, et al., 2013; Nile, et al., 2008; Pogribny, et al., 2000; Scantamburlo, et al., 2017; Sohn, et al., 2010; Takahashi, et al., 2017).

DNA sequencing results in the diagnosis of up to 40% of genetic disease cases previously unsolved using standard clinical testing (Sawyer, et al., 2016). RNA-Seq has been investigated and has shown benefit in complementing DNA testing (Cummings, et al., 2017). With the growing evidence linking methylation to disease, the application of methylation-based sequencing for undiagnosed cases of rare inherited disease is a logical progression from current testing paradigms. Expanded methylation profiling will offer the ability to detect diagnostic signals unique to the epigenome, undetectable in DNA and RNA due to lack of measurable manifestation in those materials, or due to shortcomings in current technologies or analytical approaches.

An ideal method to detect deviant methylation should offer the ability to profile at single CpG sites while enabling flexibility to consolidate calls across regions. In the context of genetic disease, we can assume that disease-associated methylation aberrations in an individual will show significant differences from individuals with unrelated phenotypes. A similar rationale was successfully used by us and others in outlier-based RNA analysis (Jenkinson, et al., 2020). However, existing solutions for the detection of differentially methylated CpG sites from bisulfite sequencing focus on traditional group vs group or multi-group experimental designs (Wreczycka, et al., 2017) and are therefore not suited to genetic disease or other outlier-based analyses (Figure 1).

37dcad26-615e-42b1-a456-a8e7444d24ea_figure1.gif

Figure 1. Conceptual differences between traditional case vs control analysis and the BOREALIS approach.

While traditional approaches to differential methylation analysis are based on group case vs. control analysis, BOREALIS utilizes a one vs. many outlier approach whereby individual(s) are compared to a cohort. This approach is especially useful when multiple similar cases are difficult to identify, as is the case in rare genetic disease studies. By comparing every affected individual to a cohort of heterogeneous individuals, outlier methylation can be identified at individual CpG sites for all members of the cohort, without the requirement for multiple similar cases.

Methods

At a given CpG site, we assume we have data in the form of methylated counts xi and total read counts ni for individuals i = 1, …, I in a cohort of size I. If every individual in the population had the exact same probability p of methylation at this site, i.e., p1 = p2 = … = pi = p where pi is the (true-but-unknown) probability of methylation for the ith individual at this site, then the methylated counts xi would be binomially distributed with parameters p and ni. However, we expect varying degrees of sample-to-sample variability in the probability of methylation at a given site even in a healthy cohort. Therefore, it is more biologically accurate to assume that pi for i=1, …, I have been sampled from a distribution over the unit interval. A common and mathematically convenient choice for this distribution is a beta distribution with parameters α and β. Thus the observed number of methylated reads xi for the ith individual can be viewed as being generated from a two-step process whereby the probability of methylation is selected pi ~ Beta (α,β) and given this probability the number of methylated reads is binomially distributed xi|pi ~ Binomial (pi,ni). Viewed in this way, we say the methylated reads are beta-binomially distributed xi ~ Beta-Binomial (ni,α,β) with parameters α and β, and popular packages for differential DNA-methylation detection such as Dispersion Shrinkage for Sequencing (DSS) (Feng and Wu, 2019) use this same distribution for methylated counts.

What BOREALIS does differently from traditional tools such as DSS (Park and Wu, 2016) is that it builds its statistical model explicitly for the purpose of outlier detection compared to a cohort, which requires alternative statistical framing and considerations as compared to group-versus-group analyses (Jenkinson, et al., 2020). Specifically, at each CpG site, BOREALIS takes the input data {(xi,ni): i = 1, …, I} and estimates the population-level parameters α and β using gamlss library (https://www.gamlss.com) (Feng and Wu, 2019). We implement Laplace Smoothing on the counts (i.e. we use as counts xi~= xi+1 and ni~= ni+2) as a regularization step to help deal with any samples with low counts. From these estimated α and β parameters, we (for the ith sample) compute the left-sided p-value by looking at the probability that a value of xi or fewer methylated reads were generated from a Beta-Binomial (ni,α,β), and likewise a right-sided p-value would evaluate the probability that a value of xi or greater methylated reads came from this distribution. We implement this probability calculation using the pBB function of gamlss. The two-sided p-value is computed as two times the lesser of these one-sided p-values.

To validate performance of the BOREALIS method, we performed Monte Carlo simulations of an outlier sample in cohorts of varying sizes sequenced at varying depths of coverage. Namely, after selecting a cohort size I and an average depth of coverage D we conducted 10,000 Monte Carlo simulations wherein each sample i in the cohort has sequencing depth di drawn from a Poisson distribution with mean D. The number of methylated reads in cohort sample i is drawn from a Beta-Binomial distribution with mean 0.8 and dispersion 0.1, and then these simulated cohort data are fit using the BOREALIS model. An outlier sample is then simulated with sequencing depth d drawn from a Poisson distribution with mean D and number of methylated reads given by a Binomial distribution with mean 0.3. BOREALIS is then used to compute a p-value thresholded at level 0.05, and the power is given by the fraction of the 10,000 simulations that correctly reject the null hypothesis. Full code to replicate the power analysis is provided as described in the Data Availability section.

Results

The results of the power analysis are plotted in Figure 2 demonstrating that BOREALIS can accurately detect outlier methylation events in modest cohort sizes and sequencing depths. BOREALIS can successfully identify outlier methylation with high statistical power utilizing a wide range of sample numbers (3-100+) and read depths (< 10 – 100s). BOREALIS is also tolerant of multiple identical outliers provided sufficient sequencing depth and cohort size as shown in Figure 3. The method supports multithreaded computation, as well as splitting across chromosomes to facilitate parallelism across compute nodes in a cluster environment. To provide users with the ability to visually review the methylation distributions underlying any call made by BOREALIS, we provide a built-in plotting function whose outputs are illustrated in Figure 4.

37dcad26-615e-42b1-a456-a8e7444d24ea_figure2.gif

Figure 2. BOREALIS power analysis and single site methylation profile output by BOREALIS.

Graphical summarization of Monte Carlo simulations of an outlier sample in a cohort of varying sizes and depths of sequencing coverage. Ten thousand simulations were performed for each set of experimental conditions whereby random sampling of sequencing depth for each sample and the proportion methylated reads was performed. BOREALIS built its beta-binomial model for each simulated cohort and an outlier sample was simulated with BOREALIS used to compute a p-value. Power estimation is based on the simulations correctly rejecting the null hypothesis at a level p < = 0.05. The parameters mu (mean methylation fraction at a given site), sigma (variability in methylation fraction at a given site) and muAb (deviation from the mean methylation level in the outlier sample) are fixed for the purposes of the simulation shown.

37dcad26-615e-42b1-a456-a8e7444d24ea_figure3.gif

Figure 3. BOREALIS power analysis containing multiple individuals with identical outlier methylation events.

Graphical summarization of Monte Carlo simulations of (A) two outlier samples and (B) three outlier samples in a cohort of varying sizes and depths of sequencing coverage. While rare disease studies generally aim to identify single outlier individuals in a cohort, this demonstrates the ability of the BOREALIS approach to identify multiple, identical outliers in the presence of sufficient read coverage and cohort size. One thousand simulations were performed for each set of experimental conditions whereby random sampling of sequencing depth for each sample and the proportion methylated reads

was performed. BOREALIS built its beta-binomial model for each simulated cohort and an outlier sample was simulated with BOREALIS used to compute a p-value.

37dcad26-615e-42b1-a456-a8e7444d24ea_figure4.gif

Figure 4. BOREALIS package functionality visually represents the methylation profile at any given CpG site, as compared to the cohort.

Here a single site within the LTB4R gene promoter is shown for Patient 72, from the BOREALIS Bioconductor package’s included test data. Full details on how to perform BOREALIS analysis and generate similar figures are detailed in the BOREALIS vignette, included with the package.

BOREALIS package vignette outline

BOREALIS is packaged with a vignette that will enable new users to become quickly familiar with program outputs and potential downstream use-cases. These include topics including:

  • 1) Running the core BOREALIS method on a cohort

  • 2) Post-processing

  • 3) Generating summary metrics

  • 4) Annotating program outputs with user-defined genomic features

  • 5) Generating visual outputs for single-site data

  • 6) Summarizing single-site data across genomic features

The vignette is available in HTML format within Bioconductor online (https://bioconductor.org/packages/release/bioc/vignettes/borealis/inst/doc/borealis.html) or packaged with the Bioconductor package itself. This vignette provides a hands-on introduction to the package and will be beneficial for new users, prior to them developing their own specific workflows.

Conclusions

BOREALIS is a novel R/Bioconductor package that addresses an unmet need in bisulfite sequencing-based genetic disease studies. The method is suited for single or multi-site downstream analysis and can successfully identify outlier methylation with high statistical power, providing a new avenue of exploration in the quest for increased diagnostic rates in genetic disease patients. It is readily available and easily implemented, enabling seamless integration with other common pipelines and tools.

Author contributions

GRO’s contributions include Data Curation, Formal Analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing. WGJ was involved with Conceptualization, Data Curation, Formal Analysis, Methodology, Project Administration, Software, Supervision, Validation, Visualization, Writing – Review & Editing. RJO and LESR participated in study design, helped with analysis and edited the manuscript. EWK participated in study design, secured funding and edited the manuscript.

Software availability

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 20 Dec 2022
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Oliver GR, Jenkinson WG, Olson RJ et al. BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1538 (https://doi.org/10.12688/f1000research.128354.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 20 Dec 2022
Views
6
Cite
Reviewer Report 24 Oct 2023
Vicente Yepez, Technical University of Munich, Munich, Germany 
Approved with Reservations
VIEWS 6
Oliver et al., developed an R package to detect outlier methylation from bisulfite sequencing data. They use a beta binomial model for this. This can be applied in rare disease diagnostics.

I think the method is novel ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Yepez V. Reviewer Report For: BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1538 (https://doi.org/10.5256/f1000research.140933.r213346)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
15
Cite
Reviewer Report 23 Aug 2023
Philipp Jurmeister, Institute of Pathology, Ludwig Maximilians University Munich, Munich, Bavaria, Germany 
Maximilian Leitheiser, Institute of Pathology, Ludwig Maximilians University Munich, Munich, Bavaria, Germany 
Approved with Reservations
VIEWS 15
Oliver et al. present a novel method for the detection of differentially methylated CpG sites in bisulfite sequencing read count data, together with an implementation thereof in the R software package 'borealis'. Its statistical model is specifically designed to detect ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Jurmeister P and Leitheiser M. Reviewer Report For: BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1538 (https://doi.org/10.5256/f1000research.140933.r189477)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 20 Dec 2022
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.