BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data

Gavin R. Oliver; W. Garrett Jenkinson; Rory J. Olson; Laura E. Schultz-Rogers; Eric W. Klee

doi:10.12688/f1000research.128354.1

Home Browse BOREALIS: an R/Bioconductor package to detect outlier methylation...

ALL Metrics

-

Views

-

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data

[version 1; peer review: 2 approved with reservations]

Gavin R. Oliver^1,2^*, W. Garrett Jenkinson^1,2^*, Rory J. Olson^1,2, Laura E. Schultz-Rogers^1,2, Eric W. Klee^1,2

Gavin R. Oliver^1,2^*, W. Garrett Jenkinson^1,2^*, [...] Rory J. Olson^1,2, Laura E. Schultz-Rogers^1,2, Eric W. Klee^1,2

^* Equal contributors

PUBLISHED 20 Dec 2022

Author details Author details

¹ Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
² Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA

Gavin R. Oliver
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

W. Garrett Jenkinson
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Project Administration, Software, Supervision, Validation, Visualization, Writing – Review & Editing

Rory J. Olson
Roles: Formal Analysis, Writing – Review & Editing

Laura E. Schultz-Rogers
Roles: Formal Analysis, Writing – Review & Editing

Eric W. Klee
Roles: Funding Acquisition, Investigation, Supervision, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Bioconductor gateway.

Abstract

Background: Rare genetic disease studies have benefited from the era of high throughput sequencing. DNA sequencing results in genetic diagnosis of 18-40% of previously unsolved cases, while the incorporation of RNA-Seq analysis has more recently been shown to generate significant numbers of previously unattainable diagnoses. While DNA methylation remains less explored, multiple inborn diseases resulting from disorders of genomic imprinting are well characterized and a growing body of literature suggests the causative or correlative role of aberrant methylation in diverse rare inherited conditions. Complex pictures of methylation patterning are also emerging, including the association of regional, multiple specific-site or even single-site methylation, with disease. The systematic application of genomic-wide methylation-based sequencing for undiagnosed cases of rare diseases is a logical progression from current testing paradigms. Similar to the rationale previously exploited in RNA-based rare disease studies, we can assume that disease-associated or causative methylation aberrations in an individual will demonstrate significant differences from other individuals with unrelated phenotypes. Thus, aberrantly methylated sites will be outliers from a heterogeneous cohort of individuals.
Methods: Based on this rationale, we present BOREALIS: Bisulfite-seq OutlieR MEthylation At SingLeSIte ReSolution. BOREALIS uses a beta binomial model to identify outlier methylation at single CpG site resolution from bisulfite sequencing data.
Results: Utilizing power analyses, we demonstrate that BOREALIS can identify outlier CpG methylation within a cohort of samples. Furthermore, we show that BOREALIS is tolerant to the inclusion of multiple identical outliers with sufficient cohort size and sequencing depth.
Conclusions: The method demonstrates improved performance versus standard statistical testing and is suited for single or multi-site downstream analysis.

Keywords

methylation, outlier, rare disease, diagnostic odyssey

Corresponding author: Eric W. Klee

Competing interests: No competing interests were disclosed.

Grant information: This work has been supported by the Mayo Clinic Center for Individualized Medicine, Intramural non-grant funding
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Oliver GR et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Oliver GR, Jenkinson WG, Olson RJ et al. BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1538 (https://doi.org/10.12688/f1000research.128354.1) First published: 20 Dec 2022, 11:1538 (https://doi.org/10.12688/f1000research.128354.1) Latest published: 20 Dec 2022, 11:1538 (https://doi.org/10.12688/f1000research.128354.1)

Introduction

Multiple inborn diseases resulting from disorders of genomic methylation are well characterized. A growing body of literature reports associations between DNA methylation and conditions including Parkinson’s Disease (Chuang, et al., 2017) and methylation-based studies have also suggested the causative or correlative role of aberrant methylation in diverse rare inherited conditions (Guastafierro, et al., 2017; Sharp, et al., 2017; Sobreira, et al., 2017).

Methods for profiling DNA methylation traditionally focused on wide genomic regions, particularly CpG islands. While microarray methylation profiling continues to be utilized, next-generation sequencing has enabled analysis at read count-level resolution and greatly increased numbers of genomic CpG sites (Wu, et al., 2015). High-resolution methodologies have enabled complex pictures of methylation to emerge, including association of multiple specific-site or even single-site methylation with developmental processes or disease (Bui, et al., 2012; Choi, et al., 2018; Claus, et al., 2012; Fürst, et al., 2012; Hashimoto, et al., 2013; Nile, et al., 2008; Pogribny, et al., 2000; Scantamburlo, et al., 2017; Sohn, et al., 2010; Takahashi, et al., 2017).

DNA sequencing results in the diagnosis of up to 40% of genetic disease cases previously unsolved using standard clinical testing (Sawyer, et al., 2016). RNA-Seq has been investigated and has shown benefit in complementing DNA testing (Cummings, et al., 2017). With the growing evidence linking methylation to disease, the application of methylation-based sequencing for undiagnosed cases of rare inherited disease is a logical progression from current testing paradigms. Expanded methylation profiling will offer the ability to detect diagnostic signals unique to the epigenome, undetectable in DNA and RNA due to lack of measurable manifestation in those materials, or due to shortcomings in current technologies or analytical approaches.

An ideal method to detect deviant methylation should offer the ability to profile at single CpG sites while enabling flexibility to consolidate calls across regions. In the context of genetic disease, we can assume that disease-associated methylation aberrations in an individual will show significant differences from individuals with unrelated phenotypes. A similar rationale was successfully used by us and others in outlier-based RNA analysis (Jenkinson, et al., 2020). However, existing solutions for the detection of differentially methylated CpG sites from bisulfite sequencing focus on traditional group vs group or multi-group experimental designs (Wreczycka, et al., 2017) and are therefore not suited to genetic disease or other outlier-based analyses (Figure 1).

Figure 1. Conceptual differences between traditional case vs control analysis and the BOREALIS approach.

While traditional approaches to differential methylation analysis are based on group case vs. control analysis, BOREALIS utilizes a one vs. many outlier approach whereby individual(s) are compared to a cohort. This approach is especially useful when multiple similar cases are difficult to identify, as is the case in rare genetic disease studies. By comparing every affected individual to a cohort of heterogeneous individuals, outlier methylation can be identified at individual CpG sites for all members of the cohort, without the requirement for multiple similar cases.

Methods

At a given CpG site, we assume we have data in the form of methylated counts x_i and total read counts n_i for individuals i = 1, …, I in a cohort of size I. If every individual in the population had the exact same probability p of methylation at this site, i.e., p₁ = p₂ = … = p_i = p where p_i is the (true-but-unknown) probability of methylation for the i^th individual at this site, then the methylated counts x_i would be binomially distributed with parameters p and n_i. However, we expect varying degrees of sample-to-sample variability in the probability of methylation at a given site even in a healthy cohort. Therefore, it is more biologically accurate to assume that p_i for i=1, …, I have been sampled from a distribution over the unit interval. A common and mathematically convenient choice for this distribution is a beta distribution with parameters α and β. Thus the observed number of methylated reads x_i for the i^th individual can be viewed as being generated from a two-step process whereby the probability of methylation is selected p_i ~ Beta (α,β) and given this probability the number of methylated reads is binomially distributed x_i|p_i ~ Binomial (p_i,n_i). Viewed in this way, we say the methylated reads are beta-binomially distributed x_i ~ Beta-Binomial (n_i,α,β) with parameters α and β, and popular packages for differential DNA-methylation detection such as Dispersion Shrinkage for Sequencing (DSS) (Feng and Wu, 2019) use this same distribution for methylated counts.

What BOREALIS does differently from traditional tools such as DSS (Park and Wu, 2016) is that it builds its statistical model explicitly for the purpose of outlier detection compared to a cohort, which requires alternative statistical framing and considerations as compared to group-versus-group analyses (Jenkinson, et al., 2020). Specifically, at each CpG site, BOREALIS takes the input data {(x_i,n_i): i = 1, …, I} and estimates the population-level parameters α and β using gamlss library (https://www.gamlss.com) (Feng and Wu, 2019). We implement Laplace Smoothing on the counts (i.e. we use as counts $\tilde{x_{i}}$ = x_i+1 and $\tilde{n_{i}}$ = n_i+2) as a regularization step to help deal with any samples with low counts. From these estimated α and β parameters, we (for the i^th sample) compute the left-sided p-value by looking at the probability that a value of x_i or fewer methylated reads were generated from a Beta-Binomial (n_i,α,β), and likewise a right-sided p-value would evaluate the probability that a value of x_i or greater methylated reads came from this distribution. We implement this probability calculation using the pBB function of gamlss. The two-sided p-value is computed as two times the lesser of these one-sided p-values.

To validate performance of the BOREALIS method, we performed Monte Carlo simulations of an outlier sample in cohorts of varying sizes sequenced at varying depths of coverage. Namely, after selecting a cohort size I and an average depth of coverage D we conducted 10,000 Monte Carlo simulations wherein each sample i in the cohort has sequencing depth d_i drawn from a Poisson distribution with mean D. The number of methylated reads in cohort sample i is drawn from a Beta-Binomial distribution with mean 0.8 and dispersion 0.1, and then these simulated cohort data are fit using the BOREALIS model. An outlier sample is then simulated with sequencing depth d drawn from a Poisson distribution with mean D and number of methylated reads given by a Binomial distribution with mean 0.3. BOREALIS is then used to compute a p-value thresholded at level 0.05, and the power is given by the fraction of the 10,000 simulations that correctly reject the null hypothesis. Full code to replicate the power analysis is provided as described in the Data Availability section.

Results

The results of the power analysis are plotted in Figure 2 demonstrating that BOREALIS can accurately detect outlier methylation events in modest cohort sizes and sequencing depths. BOREALIS can successfully identify outlier methylation with high statistical power utilizing a wide range of sample numbers (3-100+) and read depths (< 10 – 100s). BOREALIS is also tolerant of multiple identical outliers provided sufficient sequencing depth and cohort size as shown in Figure 3. The method supports multithreaded computation, as well as splitting across chromosomes to facilitate parallelism across compute nodes in a cluster environment. To provide users with the ability to visually review the methylation distributions underlying any call made by BOREALIS, we provide a built-in plotting function whose outputs are illustrated in Figure 4.

Figure 2. BOREALIS power analysis and single site methylation profile output by BOREALIS.

Graphical summarization of Monte Carlo simulations of an outlier sample in a cohort of varying sizes and depths of sequencing coverage. Ten thousand simulations were performed for each set of experimental conditions whereby random sampling of sequencing depth for each sample and the proportion methylated reads was performed. BOREALIS built its beta-binomial model for each simulated cohort and an outlier sample was simulated with BOREALIS used to compute a p-value. Power estimation is based on the simulations correctly rejecting the null hypothesis at a level p < = 0.05. The parameters mu (mean methylation fraction at a given site), sigma (variability in methylation fraction at a given site) and muAb (deviation from the mean methylation level in the outlier sample) are fixed for the purposes of the simulation shown.

Figure 3. BOREALIS power analysis containing multiple individuals with identical outlier methylation events.

Graphical summarization of Monte Carlo simulations of (A) two outlier samples and (B) three outlier samples in a cohort of varying sizes and depths of sequencing coverage. While rare disease studies generally aim to identify single outlier individuals in a cohort, this demonstrates the ability of the BOREALIS approach to identify multiple, identical outliers in the presence of sufficient read coverage and cohort size. One thousand simulations were performed for each set of experimental conditions whereby random sampling of sequencing depth for each sample and the proportion methylated reads

was performed. BOREALIS built its beta-binomial model for each simulated cohort and an outlier sample was simulated with BOREALIS used to compute a p-value.

Figure 4. BOREALIS package functionality visually represents the methylation profile at any given CpG site, as compared to the cohort.

Here a single site within the LTB4R gene promoter is shown for Patient 72, from the BOREALIS Bioconductor package’s included test data. Full details on how to perform BOREALIS analysis and generate similar figures are detailed in the BOREALIS vignette, included with the package.

BOREALIS package vignette outline

BOREALIS is packaged with a vignette that will enable new users to become quickly familiar with program outputs and potential downstream use-cases. These include topics including:

1) Running the core BOREALIS method on a cohort
2) Post-processing
3) Generating summary metrics
4) Annotating program outputs with user-defined genomic features
5) Generating visual outputs for single-site data
6) Summarizing single-site data across genomic features

The vignette is available in HTML format within Bioconductor online (https://bioconductor.org/packages/release/bioc/vignettes/borealis/inst/doc/borealis.html) or packaged with the Bioconductor package itself. This vignette provides a hands-on introduction to the package and will be beneficial for new users, prior to them developing their own specific workflows.

Conclusions

BOREALIS is a novel R/Bioconductor package that addresses an unmet need in bisulfite sequencing-based genetic disease studies. The method is suited for single or multi-site downstream analysis and can successfully identify outlier methylation with high statistical power, providing a new avenue of exploration in the quest for increased diagnostic rates in genetic disease patients. It is readily available and easily implemented, enabling seamless integration with other common pipelines and tools.

Author contributions

GRO’s contributions include Data Curation, Formal Analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing. WGJ was involved with Conceptualization, Data Curation, Formal Analysis, Methodology, Project Administration, Software, Supervision, Validation, Visualization, Writing – Review & Editing. RJO and LESR participated in study design, helped with analysis and edited the manuscript. EWK participated in study design, secured funding and edited the manuscript.

Software availability

• Software available from: https://bioconductor.org/packages/release/bioc/html/borealis.html
• Source code available from: https://github.com/GarrettJenkinson/borealis
• Archived source code at time of publication: Zenodo (DOI: https://doi.org/10.5281/zenodo.7342710) (Oliver, et al., 2022a)
• License: GNU General Public License v3.0. (GPL -3)

Data availability

Zenodo. BOREALIS Power Analysis Code and Data. (DOI: https://doi.org/10.5281/zenodo.7343136) (Oliver, et al., 2022b)

The project contains the following underlying data:

• Fig 2.csv (Power analysis data for Figure 2 from the manuscript)
• Fig 2.pdf (Figure 2 power analysis graph in PDF format)
• Fig 2_power_analysis.R (R code to regenerate the power analysis and csv output for Figure 2)
• Fig 3A.csv (Power analysis data for Figure 3A from the manuscript)
• Fig 3A.pdf (Figure 3A power analysis graph in PDF format)
• Fig 3A_power_analysis.R (R code to regenerate the power analysis and csv output for Figure 3A)
• Fig 3B.csv (Power analysis data for Figure 3B from the manuscript)
• Fig 3B.pdf (Figure 3B power analysis graph in PDF format)
• Fig 3B_power_analysis.R (R code to regenerate the power analysis and csv output for Figure 3B)
• README.txt (Text file with instructions to regenerate power analysis data and graphs)
• plotFig2.R (R code to generate the graph for Figure 2 using the csv input)
• plotFig3.R (R code to generate the graph for Figure 3A and 3B using the csv input)

Data is under a Creative Commons Attribution 4.0 International license.

Acknowledgements

An earlier version of this article can be found in “Detection of outlier methylation from bisulfite sequencing data with novel Bioconductor package BOREALIS” (doi: https://doi.org/10.1101/2022.05.19.492700).

References

Bui C, et al.: cAMP response element-binding (CREB) recruitment following a specific CpG demethylation leads to the elevated expression of the matrix metalloproteinase 13 in human articular chondrocytes and osteoarthritis. FASEB J. 2012; 26(7): 3000–3011. PubMed Abstract | Publisher Full Text
Choi NY, et al.: Novel imprinted single CpG sites found by global DNA methylation analysis in human parthenogenetic induced pluripotent stem cells. Epigenetics. 2018; 13(4): 343–351. PubMed Abstract | Publisher Full Text | Free Full Text
Chuang Y-H, et al.: Parkinson’s disease is associated with DNA methylation levels in human blood and saliva. Genome Med. 2017; 9(1): 76. PubMed Abstract | Publisher Full Text | Free Full Text
Claus R, et al.: Quantitative DNA Methylation Analysis Identifies a Single CpG Dinucleotide Important for ZAP-70 Expression and Predictive of Prognosis in Chronic Lymphocytic Leukemia. J. Clin. Oncol. 2012; 30(20): 2483–2491. PubMed Abstract | Publisher Full Text | Free Full Text
Cummings BB, et al.: Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 2017; 9(386). PubMed Abstract | Publisher Full Text | Free Full Text
Feng H, Wu H: Differential methylation analysis for bisulfite sequencing using DSS. Quant Biol. 2019; 7(4): 327–334. PubMed Abstract | Publisher Full Text | Free Full Text
Fürst RW, et al.: A differentially methylated single CpG-site is correlated with estrogen receptor alpha transcription. J. Steroid Biochem. Mol. Biol. 2012; 130(1-2): 96–104. PubMed Abstract | Publisher Full Text
Guastafierro T, et al.: Genome-wide DNA methylation analysis in blood cells from patients with Werner syndrome. Clin. Epigenetics. 2017; 9(1): 92. PubMed Abstract | Publisher Full Text | Free Full Text
Hashimoto K, et al.: Regulated transcription of human matrix metalloproteinase 13 (MMP13) and interleukin-1β (IL1B) genes in chondrocytes depends on methylation of specific proximal promoter CpG sites. J. Biol. Chem. 2013; 288(14): 10061–10072. PubMed Abstract | Publisher Full Text | Free Full Text
Jenkinson G, et al.: LeafCutterMD: an algorithm for outlier splicing detection in rare diseases. Bioinformatics. 2020; 36(17): 4609–4615. PubMed Abstract | Publisher Full Text | Free Full Text
Nile CJ, et al.: Methylation status of a single CpG site in the IL6 promoter is related to IL6 messenger RNA levels and rheumatoid arthritis. Arthritis & Rheumatism. 2008; 58(9): 2686–2693. PubMed Abstract | Publisher Full Text
Oliver GR, Jenkinson WG, Klee EW:BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data (3.15).Zenodo.2022a. Publisher Full Text
Oliver GR, Jenkinson WG, Klee EW:BOREALIS Power Analysis Code and Data (1.0). [Data set]. Zenodo.2022b. Publisher Full Text
Park Y, Wu H: Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics. 2016; 32(10): 1446–1453. PubMed Abstract | Publisher Full Text
Pogribny IP, et al.: Single-site methylation within the p53 promoter region reduces gene expression in a reporter gene construct: possible in vivo relevance during tumorigenesis. Cancer Res. 2000; 60(3): 588–594. PubMed Abstract
Sawyer SL, et al.: Utility of whole-exome sequencing for those near the end of the diagnostic odyssey: time to address gaps in care. Clin. Genet. 2016; 89(3): 275–284. PubMed Abstract | Publisher Full Text | Free Full Text
Scantamburlo G, et al.: Interleukin-4 Induces CpG Site-Specific Demethylation of the Pendrin Promoter in Primary Human Bronchial Epithelial Cells. Cell. Physiol. Biochem. 2017; 41(4): 1491–1502.
Sharp GC, et al.: Distinct DNA methylation profiles in subtypes of orofacial cleft. Clin. Epigenetics. 2017; 9(1): 63. PubMed Abstract | Publisher Full Text | Free Full Text
Sobreira N, et al.: Patients with a Kabuki syndrome phenotype demonstrate DNA methylation abnormalities. Eur. J. Hum. Genet. 2017; 25(12): 1335–1344. PubMed Abstract | Publisher Full Text | Free Full Text
Sohn BH, et al.: Functional switching of TGF-beta1 signaling in liver cancer via epigenetic modulation of a single CpG site in TTP promoter. Gastroenterology. 2010; 138(5): 1898–1908.e12. PubMed Abstract | Publisher Full Text
Takahashi A, et al.: DNA methylation of the RUNX2 P1 promoter mediates MMP13 transcription in chondrocytes. Sci. Rep. 2017; 7(1): 7771. PubMed Abstract | Publisher Full Text | Free Full Text
Wreczycka K, et al.: Strategies for analyzing bisulfite sequencing data. J. Biotechnol. 2017; 261: 105–115. PubMed Abstract | Publisher Full Text
Wu H, et al.: Detection of differentially methylated regions from whole-genome bisulfite sequencing data without replicates. Nucleic Acids Res. 2015; 43(21): e141–e141. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 20 Dec 2022

Author details Author details

¹ Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
² Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA

Gavin R. Oliver
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

W. Garrett Jenkinson
Roles: Conceptualization, Data Curation, Formal Analysis, Methodology, Project Administration, Software, Supervision, Validation, Visualization, Writing – Review & Editing

Rory J. Olson
Roles: Formal Analysis, Writing – Review & Editing

Laura E. Schultz-Rogers
Roles: Formal Analysis, Writing – Review & Editing

Eric W. Klee
Roles: Funding Acquisition, Investigation, Supervision, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work has been supported by the Mayo Clinic Center for Individualized Medicine, Intramural non-grant funding
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (1)

version 1

Published: 20 Dec 2022, 11:1538

https://doi.org/10.12688/f1000research.128354.1

Copyright

© 2022 Oliver GR et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

0

SEE MORE DETAILS

CITE

how to cite this article

Oliver GR, Jenkinson WG, Olson RJ et al. BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:1538 (https://doi.org/10.12688/f1000research.128354.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 20 Dec 2022

Views

6

Reviewer Report 24 Oct 2023

Vicente Yepez, Technical University of Munich, Munich, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.140933.r213346

Oliver et al., developed an R package to detect outlier methylation from bisulfite sequencing data. They use a beta binomial model for this. This can be applied in rare disease diagnostics.

I think the method is novel ... Continue reading

Oliver et al., developed an R package to detect outlier methylation from bisulfite sequencing data. They use a beta binomial model for this. This can be applied in rare disease diagnostics.

I think the method is novel and sound, but I have some concerns:

Major:

It is unclear to me what is the difference between this article and the one described in: https://www.biorxiv.org/content/10.1101/2022.05.19.492700v1.
It is unclear where the data came from.
It is unclear if no other method to detect outlier methylation from bisulfite sequencing data exists or this is the first study doing that.
It should be described how CpG sites are assigned to genes.
The Results section should describe the following:

- the cohort including number of samples and whether they were affected. Details should be included in Methods.

- how the methylation data was counted and # of features detected. Details should be included in Methods.

- what were the results after applying BOREALIS to the data. How many outliers/sample were obtained.

- A quantile-quantile plot of the p-values must be included to verity that they are calibrated.

- Is there any multiple testing performed? It should, as I guess there are thousands of sites tested.

Minor:

It could be stated in the Abstract the minimal number of the "sufficient cohort size".
To strengthen it, consider adding another citation to the sentence: "DNA sequencing results in the diagnosis of up to 40% of genetic disease cases previously unsolved using standard clinical testing (Sawyer, et al., 2016)".
It should be described how exactly was RNA-seq investigated in the sentence: "RNA-Seq has been investigated and has shown benefit in complementing DNA testing (Cummings, et al., 2017)". Also, consider adding other citations from other studies to it.
OUTRIDER by Brechtmann et al., AJHG 2018¹, should be cited here as it was presumably the 1st method for outliers in RNA-seq data: "A similar rationale was successfully used by us and others in outlier-based RNA analysis (Jenkinson, et al., 2020)".
Other parameters should be tested in the MC simulations besides mean 0.8 and dispersion 0.1.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

References

1. Brechtmann F, Mertes C, Matusevičiūtė A, Yépez VA, et al.: OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data.Am J Hum Genet. 2018; 103 (6): 907-917 PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: RNA-seq based diagnostics of rare diseases, bioinformatics, statistical modeling

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

15

Reviewer Report 23 Aug 2023

Philipp Jurmeister, Institute of Pathology, Ludwig Maximilians University Munich, Munich, Bavaria, Germany

Maximilian Leitheiser, Institute of Pathology, Ludwig Maximilians University Munich, Munich, Bavaria, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.140933.r189477

Oliver et al. present a novel method for the detection of differentially methylated CpG sites in bisulfite sequencing read count data, together with an implementation thereof in the R software package 'borealis'. Its statistical model is specifically designed to detect ... Continue reading

Oliver et al. present a novel method for the detection of differentially methylated CpG sites in bisulfite sequencing read count data, together with an implementation thereof in the R software package 'borealis'. Its statistical model is specifically designed to detect a single outlier sample in a cohort of samples with consistent methylation whereas existing methods for the detection of differential methylation are designed for a group-vs-group approach. The proposed method is an adaption of the recently published LeafCutterMD package which applied the same approach to outlier splicing detection. The outlier detection setting for differential methylation does indeed seem of great relevance, specifically in the use case of the detection of rare diseases which the authors also mention. The performance of the proposed method in the shown analyses of true positive rate across sequencing depths and cohort sizes in a synthetic dataset is in itself convincing. The software package is well documented with detailed function descriptions and an extensive vignette. However, there are several weaknesses which should be addressed to improve the overall quality of the manuscript:

Major concerns

The statistical model used here does indeed directly address the described outlier detection setting and its suitability for this task seems plausible. However, we suspect existing software packages implementing a group-vs-group approach could also be applied in this setting (with group sizes 1 and l-1). We would deem a comparison of the performance in the outlier detection setting between borealis and existing software packages necessary to support the claim made about the specific suitability of borealis.
(a) While the borealis R packages, as obtained from github, worked flawlessly for us, we were not able to run the code for the reproduction of Figure 2. We adjusted ‘Fig2_power_analysis.R’ to only test aveDepth <- c(80, 100), cohSize <- c(90, 100) and with nSamps = 1000. Running the script consistently yielded power = 0 for all parameters. Further inspection revealed an error occurred in line 32: ‘Error in `[.data.frame`(DatA, , ResCha) : undefined columns selected’ that was caught in exception handling. The scripts for Fig.3 and 4 showed similar behavior. We used gamlss version 5.4-12 and gamlss.dist version 6.0-5. We are happy to provide more information (e.g. full R session info) if that is considered helpful or to receive suggestions regarding potential mistakes in our handling of the code.

(b) According to the general description of the proposed method in the methods section (and its implementation in the borealis package), the betabinomial distribution parameters for a specific CpG site are estimated on all samples of the cohort and reused for the computation of all p-values. In particular, this means that the sample to be tested is also included in the estimation of the model parameters. However, in the experiments for power analysis, the parameters are estimated not using the outlier. We feel the two approaches should be aligned.
While the results showing the true positive rate for the detection of synthetic outlier detection are convincing, information on the false positive rate should also be provided.

Minor concerns

Since the proposed method contains a novel (at least for the use in methylation analysis) hypothesis test, we find that a slightly more elaborate presentation including naming hypothesis and the (admittedly simple) test statistic and its distribution would provide more clarity. An additional remark on the choice to include the tested outlier in the normal cohort for parameter estimation could also be helpful.
We found it not immediately obvious how the multiple outliers are used in the procedure to generate Figure 3 (they have identical methylation data; only one test is performed; the remaining outliers are part of the cohort that is used to estimate distribution parameters), especially since the exclusion of the tested outlier while estimating distribution parameters is not consistent with the first description of the method in the paper (see Major concern 2b). A brief description of this analysis in the methods would be helpful for an easier understanding the experiment.
DSS is named as an example for a well-known R package that uses a group-vs-group approach for detecting differential methylation. We feel that more support is needed still for the claim made that all existing software packages follow this approach and borealis now fills this gap (e.g. referencing several other well-known software packages for differential methylation detection and explicitly stating their use of a group-vs-group approach), especially since this point seems central to the paper.
The introduction includes the sentence “An ideal method to detect deviant methylation should offer the ability to profile at single CpG sites while enabling flexibility to consolidate calls across regions.”. However, our understanding is that detection of differentially methylated regions is not addressed in this paper. If DMR detection can indeed be achieved with borealis, an explanation in the paper would help to highlight this. Otherwise however, we think that the functionality of borealis is sufficient as is and a deletion or rephrasing of the sentence would resolve the issue.
The performance of the proposed method has only been evaluated on a synthetic dataset. We think an experiment on real-world data would provide important data for judging the performance of the method in the clinical research use-cases mentioned in the paper.
When evaluating the results of the borealis workflow, p-value adjustment is most likely necessary. A comment on this fact or potentially even an inclusion in borealis functionality would prevent the misconception that the p-values can be intrepreted as is.
The results subsection 'BOREALIS package vignette outline' is quite long and its content does not appear suitable for a method article to us, even though the vignette itself is very informative. Maybe briefly mentioning its existence and purpose would be more adequate?

Questions/Interests

Adding to major concern 1), we would find potential theoretical insights into the properties of the group-vs-group approaches when reduced to group sizes 1 and (l-1) and a comparison to the proposed approach on a theoretical level extremely interesting.
While the currently provided functions of the borealis package are very easy to use, several post-processing steps that are not implemented yet are still necessary for likely standard use-cases (see section 4 of the vignette). Some of these steps are not straightforward (programming-wise), others could represent potential pitfalls (pvalue correction, also see minor concern 3). We feel user-friendliness would greatly benefit from also providing functions in the borealis package for these steps, so that a full data import to interpretation pipeline is available for non-programming-savvy users.
The performance of the proposed method was evaluated for different cohort sizes and average sequencing depths. We would find it very interesting to see how varying the difference between methylation probability of the outlier and in the normal cohort affects the results.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: DNA methylation, machine learning, bioinformatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Comments on this article Comments (0)

Version 1

VERSION 1 PUBLISHED 20 Dec 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 1 20 Dec 22	read	read

Philipp Jurmeister, Ludwig Maximilians University Munich, Munich, Germany

Maximilian Leitheiser, Ludwig Maximilians University Munich, Munich, Germany
Vicente Yepez, Technical University of Munich, Munich, Germany

Comments on this article

All Comments(0)

Add a comment

Sign up for content alerts

Browse by related subjects

Back to all reports

Reviewer Report

6 Views

24 Oct 2023 | for Version 1

Vicente Yepez, Technical University of Munich, Munich, Germany

6 Views Cite this report Responses(0)

Approved With Reservations

Oliver et al., developed an R package to detect outlier methylation from bisulfite sequencing data. They use a beta binomial model for this. This can be applied in rare disease diagnostics.

I think the method is novel and sound, but I have some concerns:

Major:

It is unclear to me what is the difference between this article and the one described in: https://www.biorxiv.org/content/10.1101/2022.05.19.492700v1.
It is unclear where the data came from.
It is unclear if no other method to detect outlier methylation from bisulfite sequencing data exists or this is the first study doing that.
It should be described how CpG sites are assigned to genes.
The Results section should describe the following:

- the cohort including number of samples and whether they were affected. Details should be included in Methods.

- how the methylation data was counted and # of features detected. Details should be included in Methods.

- what were the results after applying BOREALIS to the data. How many outliers/sample were obtained.

- A quantile-quantile plot of the p-values must be included to verity that they are calibrated.

- Is there any multiple testing performed? It should, as I guess there are thousands of sites tested.

Minor:

It could be stated in the Abstract the minimal number of the "sufficient cohort size".
To strengthen it, consider adding another citation to the sentence: "DNA sequencing results in the diagnosis of up to 40% of genetic disease cases previously unsolved using standard clinical testing (Sawyer, et al., 2016)".
It should be described how exactly was RNA-seq investigated in the sentence: "RNA-Seq has been investigated and has shown benefit in complementing DNA testing (Cummings, et al., 2017)". Also, consider adding other citations from other studies to it.
OUTRIDER by Brechtmann et al., AJHG 2018¹, should be cited here as it was presumably the 1st method for outliers in RNA-seq data: "A similar rationale was successfully used by us and others in outlier-based RNA analysis (Jenkinson, et al., 2020)".
Other parameters should be tested in the MC simulations besides mean 0.8 and dispersion 0.1.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

References

1. Brechtmann F, Mertes C, Matusevičiūtė A, Yépez VA, et al.: OUTRIDER: A Statistical Method for Detecting Aberrantly Expressed Genes in RNA Sequencing Data.Am J Hum Genet. 2018; 103 (6): 907-917 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

RNA-seq based diagnostics of rare diseases, bioinformatics, statistical modeling

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

15 Views

23 Aug 2023 | for Version 1

Philipp Jurmeister, Institute of Pathology, Ludwig Maximilians University Munich, Munich, Bavaria, Germany

Maximilian Leitheiser, Institute of Pathology, Ludwig Maximilians University Munich, Munich, Bavaria, Germany

15 Views Cite this report Responses(0)

Approved With Reservations

Oliver et al. present a novel method for the detection of differentially methylated CpG sites in bisulfite sequencing read count data, together with an implementation thereof in the R software package 'borealis'. Its statistical model is specifically designed to detect a single outlier sample in a cohort of samples with consistent methylation whereas existing methods for the detection of differential methylation are designed for a group-vs-group approach. The proposed method is an adaption of the recently published LeafCutterMD package which applied the same approach to outlier splicing detection. The outlier detection setting for differential methylation does indeed seem of great relevance, specifically in the use case of the detection of rare diseases which the authors also mention. The performance of the proposed method in the shown analyses of true positive rate across sequencing depths and cohort sizes in a synthetic dataset is in itself convincing. The software package is well documented with detailed function descriptions and an extensive vignette. However, there are several weaknesses which should be addressed to improve the overall quality of the manuscript:

Major concerns

The statistical model used here does indeed directly address the described outlier detection setting and its suitability for this task seems plausible. However, we suspect existing software packages implementing a group-vs-group approach could also be applied in this setting (with group sizes 1 and l-1). We would deem a comparison of the performance in the outlier detection setting between borealis and existing software packages necessary to support the claim made about the specific suitability of borealis.
(a) While the borealis R packages, as obtained from github, worked flawlessly for us, we were not able to run the code for the reproduction of Figure 2. We adjusted ‘Fig2_power_analysis.R’ to only test aveDepth <- c(80, 100), cohSize <- c(90, 100) and with nSamps = 1000. Running the script consistently yielded power = 0 for all parameters. Further inspection revealed an error occurred in line 32: ‘Error in `[.data.frame`(DatA, , ResCha) : undefined columns selected’ that was caught in exception handling. The scripts for Fig.3 and 4 showed similar behavior. We used gamlss version 5.4-12 and gamlss.dist version 6.0-5. We are happy to provide more information (e.g. full R session info) if that is considered helpful or to receive suggestions regarding potential mistakes in our handling of the code.

(b) According to the general description of the proposed method in the methods section (and its implementation in the borealis package), the betabinomial distribution parameters for a specific CpG site are estimated on all samples of the cohort and reused for the computation of all p-values. In particular, this means that the sample to be tested is also included in the estimation of the model parameters. However, in the experiments for power analysis, the parameters are estimated not using the outlier. We feel the two approaches should be aligned.
While the results showing the true positive rate for the detection of synthetic outlier detection are convincing, information on the false positive rate should also be provided.

Minor concerns

Since the proposed method contains a novel (at least for the use in methylation analysis) hypothesis test, we find that a slightly more elaborate presentation including naming hypothesis and the (admittedly simple) test statistic and its distribution would provide more clarity. An additional remark on the choice to include the tested outlier in the normal cohort for parameter estimation could also be helpful.
We found it not immediately obvious how the multiple outliers are used in the procedure to generate Figure 3 (they have identical methylation data; only one test is performed; the remaining outliers are part of the cohort that is used to estimate distribution parameters), especially since the exclusion of the tested outlier while estimating distribution parameters is not consistent with the first description of the method in the paper (see Major concern 2b). A brief description of this analysis in the methods would be helpful for an easier understanding the experiment.
DSS is named as an example for a well-known R package that uses a group-vs-group approach for detecting differential methylation. We feel that more support is needed still for the claim made that all existing software packages follow this approach and borealis now fills this gap (e.g. referencing several other well-known software packages for differential methylation detection and explicitly stating their use of a group-vs-group approach), especially since this point seems central to the paper.
The introduction includes the sentence “An ideal method to detect deviant methylation should offer the ability to profile at single CpG sites while enabling flexibility to consolidate calls across regions.”. However, our understanding is that detection of differentially methylated regions is not addressed in this paper. If DMR detection can indeed be achieved with borealis, an explanation in the paper would help to highlight this. Otherwise however, we think that the functionality of borealis is sufficient as is and a deletion or rephrasing of the sentence would resolve the issue.
The performance of the proposed method has only been evaluated on a synthetic dataset. We think an experiment on real-world data would provide important data for judging the performance of the method in the clinical research use-cases mentioned in the paper.
When evaluating the results of the borealis workflow, p-value adjustment is most likely necessary. A comment on this fact or potentially even an inclusion in borealis functionality would prevent the misconception that the p-values can be intrepreted as is.
The results subsection 'BOREALIS package vignette outline' is quite long and its content does not appear suitable for a method article to us, even though the vignette itself is very informative. Maybe briefly mentioning its existence and purpose would be more adequate?

Questions/Interests

Adding to major concern 1), we would find potential theoretical insights into the properties of the group-vs-group approaches when reduced to group sizes 1 and (l-1) and a comparison to the proposed approach on a theoretical level extremely interesting.
While the currently provided functions of the borealis package are very easy to use, several post-processing steps that are not implemented yet are still necessary for likely standard use-cases (see section 4 of the vignette). Some of these steps are not straightforward (programming-wise), others could represent potential pitfalls (pvalue correction, also see minor concern 3). We feel user-friendliness would greatly benefit from also providing functions in the borealis package for these steps, so that a full data import to interpretation pipeline is available for non-programming-savvy users.
The performance of the proposed method was evaluated for different cohort sizes and average sequencing depths. We would find it very interesting to see how varying the difference between methylation probability of the outlier and in the normal cohort affects the results.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Partly
Are sufficient details provided to allow replication of the method development and its use by others?

Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

DNA methylation, machine learning, bioinformatics

We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

Respond to this report

Responses (0)

[1] Bui C, et al.: cAMP response element-binding (CREB) recruitment following a specific CpG demethylation leads to the elevated expression of the matrix metalloproteinase 13 in human articular chondrocytes and osteoarthritis. FASEB J. 2012; 26(7): 3000–3011. PubMed Abstract | Publisher Full Text

[2] Choi NY, et al.: Novel imprinted single CpG sites found by global DNA methylation analysis in human parthenogenetic induced pluripotent stem cells. Epigenetics. 2018; 13(4): 343–351. PubMed Abstract | Publisher Full Text | Free Full Text

[3] Chuang Y-H, et al.: Parkinson’s disease is associated with DNA methylation levels in human blood and saliva. Genome Med. 2017; 9(1): 76. PubMed Abstract | Publisher Full Text | Free Full Text

[4] Claus R, et al.: Quantitative DNA Methylation Analysis Identifies a Single CpG Dinucleotide Important for ZAP-70 Expression and Predictive of Prognosis in Chronic Lymphocytic Leukemia. J. Clin. Oncol. 2012; 30(20): 2483–2491. PubMed Abstract | Publisher Full Text | Free Full Text

[5] Cummings BB, et al.: Improving genetic diagnosis in Mendelian disease with transcriptome sequencing. Sci. Transl. Med. 2017; 9(386). PubMed Abstract | Publisher Full Text | Free Full Text

[6] Feng H, Wu H: Differential methylation analysis for bisulfite sequencing using DSS. Quant Biol. 2019; 7(4): 327–334. PubMed Abstract | Publisher Full Text | Free Full Text

[7] Fürst RW, et al.: A differentially methylated single CpG-site is correlated with estrogen receptor alpha transcription. J. Steroid Biochem. Mol. Biol. 2012; 130(1-2): 96–104. PubMed Abstract | Publisher Full Text

[8] Guastafierro T, et al.: Genome-wide DNA methylation analysis in blood cells from patients with Werner syndrome. Clin. Epigenetics. 2017; 9(1): 92. PubMed Abstract | Publisher Full Text | Free Full Text

[9] Hashimoto K, et al.: Regulated transcription of human matrix metalloproteinase 13 (MMP13) and interleukin-1β (IL1B) genes in chondrocytes depends on methylation of specific proximal promoter CpG sites. J. Biol. Chem. 2013; 288(14): 10061–10072. PubMed Abstract | Publisher Full Text | Free Full Text

[10] Jenkinson G, et al.: LeafCutterMD: an algorithm for outlier splicing detection in rare diseases. Bioinformatics. 2020; 36(17): 4609–4615. PubMed Abstract | Publisher Full Text | Free Full Text

[11] Nile CJ, et al.: Methylation status of a single CpG site in the IL6 promoter is related to IL6 messenger RNA levels and rheumatoid arthritis. Arthritis & Rheumatism. 2008; 58(9): 2686–2693. PubMed Abstract | Publisher Full Text

[12] Oliver GR, Jenkinson WG, Klee EW:BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data (3.15).Zenodo.2022a. Publisher Full Text

[13] Oliver GR, Jenkinson WG, Klee EW:BOREALIS Power Analysis Code and Data (1.0). [Data set]. Zenodo.2022b. Publisher Full Text

[14] Park Y, Wu H: Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics. 2016; 32(10): 1446–1453. PubMed Abstract | Publisher Full Text

[15] Pogribny IP, et al.: Single-site methylation within the p53 promoter region reduces gene expression in a reporter gene construct: possible in vivo relevance during tumorigenesis. Cancer Res. 2000; 60(3): 588–594. PubMed Abstract

[16] Sawyer SL, et al.: Utility of whole-exome sequencing for those near the end of the diagnostic odyssey: time to address gaps in care. Clin. Genet. 2016; 89(3): 275–284. PubMed Abstract | Publisher Full Text | Free Full Text

[17] Scantamburlo G, et al.: Interleukin-4 Induces CpG Site-Specific Demethylation of the Pendrin Promoter in Primary Human Bronchial Epithelial Cells. Cell. Physiol. Biochem. 2017; 41(4): 1491–1502.

[18] Sharp GC, et al.: Distinct DNA methylation profiles in subtypes of orofacial cleft. Clin. Epigenetics. 2017; 9(1): 63. PubMed Abstract | Publisher Full Text | Free Full Text

[19] Sobreira N, et al.: Patients with a Kabuki syndrome phenotype demonstrate DNA methylation abnormalities. Eur. J. Hum. Genet. 2017; 25(12): 1335–1344. PubMed Abstract | Publisher Full Text | Free Full Text

[20] Sohn BH, et al.: Functional switching of TGF-beta1 signaling in liver cancer via epigenetic modulation of a single CpG site in TTP promoter. Gastroenterology. 2010; 138(5): 1898–1908.e12. PubMed Abstract | Publisher Full Text

[21] Takahashi A, et al.: DNA methylation of the RUNX2 P1 promoter mediates MMP13 transcription in chondrocytes. Sci. Rep. 2017; 7(1): 7771. PubMed Abstract | Publisher Full Text | Free Full Text

[22] Wreczycka K, et al.: Strategies for analyzing bisulfite sequencing data. J. Biotechnol. 2017; 261: 105–115. PubMed Abstract | Publisher Full Text

[23] Wu H, et al.: Detection of differentially methylated regions from whole-genome bisulfite sequencing data without replicates. Nucleic Acids Res. 2015; 43(21): e141–e141. PubMed Abstract | Publisher Full Text

BOREALIS: an R/Bioconductor package to detect outlier methylation from bisulfite sequencing data

Abstract

Keywords

Introduction

Figure 1. Conceptual differences between traditional case vs control analysis and the BOREALIS approach.

Methods

Results

Figure 2. BOREALIS power analysis and single site methylation profile output by BOREALIS.

Figure 3. BOREALIS power analysis containing multiple individuals with identical outlier methylation events.

Figure 4. BOREALIS package functionality visually represents the methylation profile at any given CpG site, as compared to the cohort.

BOREALIS package vignette outline

Conclusions

Author contributions

Software availability

Data availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated