Keywords
DNA methylation, epigenetics, bisulfite sequencing, WGBS, RRBS, software, DMRs, integrative analysis
DNA methylation, epigenetics, bisulfite sequencing, WGBS, RRBS, software, DMRs, integrative analysis
High-throughput DNA methylation sequencing protocols, such as whole-genome bisulfite sequencing (WGBS) and targeted bisulfite sequencing (e. g., RRBS), have made it possible to precisely and accurately measure this major epigenetic modification on a genome wide scale. The impact of DNA methylation on processes, such as cell differentiation, gene expression, chromatin structure, and cancerogenesis, has raised substantial interest in analyzing DNA methylation in many sectors of life sciences. For example, methylomes of a large number of samples have been sequenced in the context of cancer projects and developmental studies1–5. Also researchers investigating obesity, neurodegenerative diseases, Alzheimer’s, or Parkinson’s disease, have begun to focus on DNA methylation6–9.
A number of time consuming data analysis steps are required in virtually all these projects, i. e., quality control, read alignment, and methylation rate calculation. However, performing each step by hand is highly error prone, takes time, and impacts reproducibility. To ensure a consistent and reproducible processing, we have developed the Bisulfite Analysis Tooklit BAT. The workflow enables a fast and easy analysis of bisulfite converted high-throughput sequencing reads. It is specifically designed to facilitate the analysis for biologists and physicians with little bioinformatic knowledge, as well as for bioinformaticians that already work on sequencing data, but are not familiar with the characteristics of bisulfite sequencing data.
BAT is a modular toolkit allowing to easily generate workflows to analyze bisulfite sequencing data. The toolkit includes modules for read alignment (mapping module), methylation level estimation (calling module), grouping of samples (grouping module) and identification of differentially methylated regions (DMR module) (Figure 1). Further modules allow the integration of gene expression, histone modification data, or transcription factor binding site annotation. These modules facilitate the functional analysis of the effects of differential methylation.
It comprises four modules covering (left to right) read alignment, methylation rate calling, basic group analysis, and DMR calling. The modules consist of a collection of scripts that build up on one another, but easily single steps can be covered by alternative tools.
Each of the modules can be run on its own, and the minimal system requirements depend on the respective module. The computational most expensive module is the mapping module. Here, the aligner segemehl10 in its bisulfite mode is used, which requires about 55 GB physical RAM for the alignment of reads to the human genome hg19.
The toolkit itself is written in Perl and calls software components mainly written in C and R to ensure swift calculations. All software requirements are listed on our website (www.bioinf.uni-leipzig.de/Software/BAT/install/#requirements). The default parameters for the tools included in the BAT pipeline are optimized to process bisulfite sequencing data for most applications. In order to enhance reproducibility and reduce potential errors, the number of parameters that need to be set by the user has been carefully reduced to a minimum. Due to its modularity, however, the toolkit is flexible and can easily be extended or customized to specific needs. To allow for workflow modifications and extensions, standardized formats are used and interfaces to several other tools are provided. Basic steps, e.g., processing from raw reads to a single alignment file from multiple sequencing runs, is split into its pre-, post-, and main processing steps, to allow for the customized extension of the workflow. Error handling is eased by parameter and file checks prior to the analysis, and meaningful error messages allow a quick trouble shooting.
A detailed documentation of all modules, including parameter description, recommended additional tools, analysis reports, and data visualizations produced by the BAT workflow are summarized on www.bioinf.uni-leipzig.de/Software/BAT. Moreover, all automatically created visualizations are shown on the webpage. Data and figures displayed there are derived from a small example data set of two groups with four samples each, adopted from Kretzmer et al11. Our webpage provides raw FASTQ files of one sample as well as the methylation rate files of all eight samples along with expression and annotation data. This example data set and shell scripts covering all modules of BAT can be downloaded and adapted together with the toolkit.
Furthermore, BAT is provided as Docker12 image and can be obtained from https://hub.docker.com/r/christianbioinf/bat/. The Docker images ensure platform independent usage of our toolkit. All programs that are used by BAT are already installed in the Docker image and dependencies are resolved. Existing hard drives are mounted to avoid time consuming translocation or upload of the frequently huge data.
Resembling a common study design, we assembled a small case-control example dataset, adopted from recently published data11. It is a subset of a paired-end human WGBS dataset, comprising 8 samples (control: S1–S4, case: S5–S8). It comprises the raw reads in FASTQ format of one sample and the already called methylation rates of all 8 samples in VCF format. The following modules can now be used to process and analyze bisulfite sequencing data including detection of methylation differences between case and control samples. The use case starts with the alignment of the raw sequencing data using the mapping module. The single components of BAT and their functionality are described in the following:
The read alignment step is taken care of by the module BAT_mapping. It includes a bisulfite-sensitive read alignment using segemehl10, a quality filtering step, and the conversion of the alignments to an indexed and compressed BAM file by samtools13. Using BAT_mapping_stat, the quality of the mapping can be assessed by the number and fraction of mapped pairs or reads, the multiplicity of read alignments, and the alignments’ error rates. In case of large experiments where a sample is sequenced multiplexed on multiple lanes or flow cells, the read alignments of each sample can easily be merged using BAT_merging, including the addition of read group information to allow for tracebacks of lane effects if necessary.
Following mapping, the methylation information needs to be extracted from the read alignments. Prior to this methylation calling it is, however, recommended to exclude potential biases by clipping alignment overlaps of paired-end reads (e.g. using bamutil’s clipoverlap14) or by excluding incompletely converted or artificially introduced cytosines with the M-bias detection method (e.g. using BSeQC15). Subsequently, the methylation information can be extracted using the module BAT_calling, which returns a VCF-style file that includes detailed information for each cytosine. This initial set of positions can be filtered by coverage using BAT_filtering to exclude unreliable methylation information from either lowly covered or very highly covered positions (e.g. in repetitive regions). Moreover, it is also possible to filter by genomic context (e.g., to restrict to CG context only). Apart from a VCF file, BAT_filtering reports the methylation level at positions passing the filter in bedGraph format for easy inspection in IGV16 or upload to the UCSC genome browser17. Additionally, the module automatically produces plots showing the distribution of coverages and methylation rates for the complete and the filtered set of positions (Figure 2A), giving the user the opportunity to check and possibly fine tune the filtering parameters.
Annotation items are ENCODE transcription factor binding sites for GM12878 cell line. A) Distribution of coverage. B) Circos plot showing the genome-wide methylation level of eight samples as heatmap. C) Binned distribution of average methylation rate per CpG for each group. D) Boxplots of genome-wide mean methylation rate per group. E) Hierarchical clustered heatmap of the methylation rates of all samples over all annotation items. F) Boxplots of average methylation rate per annotation item. G) Correlating DMR plot shows methylation and expression of a DMR - gene pair. Note that all figures were produced by BAT itself, but were minorly post-edited to fit the limited space.
The third module now facilitates the transition from single sample analysis to groups of multiple samples. First, methylation information from individual samples is combined to groups and summarized with BAT_summarize. It reports the mean methylation rate per group and position as well as difference of the group’s mean methylation rate per position. The summary module can be parameterized to only report positions where each group has a minimum number of samples with sufficient coverage. For convenience, all files are exported in both bedGraph and bigWig format for inspection in UCSC genome browser or IGV. Moreover, a circos plot containing a genome-wide methylation rate heatmap for each sample is automatically produced (Figure 2B). Based on the summary files, a number of overview statistics and plots can be generated using BAT_overview. This includes a hierarchical clustering of the samples based on their methylation profile, a plot of binned mean methylation rates per group (Figure 2C), boxplots of group-wise mean methylation rates (Figure 2D), a smoothed scatterplot showing the correlation between the groups’ mean methylation rate per position, and a barplot of the distribution of group methylation differences. Subsequently, BAT_annotation can be used to inspect the methylation of the samples in regions of interest or annotations such as transcription factor binding sites (TFBS), CpG islands, shores, or promoter regions. Therefore, a hierarchically clustered heatmap of all samples (Figure 2E), is produced and the per-group and per-sample mean methylation rate is calculated (Figure 2F).
Finally, the fourth module features the identification and analysis of differentially methylated regions (DMRs) between groups (BAT_DMRcalling). It employs the DMR calling tool metilene18 which is based on circular binary segmentation of the group methylation difference signal in conjunction with a two-dimensional non-parametric statistical test. Afterwards, the DMRs reported by metilene can be filtered by several criteria, e.g., length (in nt or number of Cs), significance (i.e., q-value), and minimum mean methylation difference, and then converted to BED/bedGraph format. The BED file contains unique identifiers per DMR and reports regions of hyper/hypo methylation. Additionally, the bedGraph file can be used to display the mean group methylation difference of the DMRs. Moreover, BAT_DMRcalling produces overview statistics of the set of filtered DMRs including a histogram of the length and methylation difference of the filtered DMRs, a correlation plot of the mean methylation rate of DMRs in both groups and a plot of the methylation difference vs. the q-value for each DMR. Last but not least, BAT_correlating allows for integration of the DMRs with expression data. Given the methylation information, an expression value of genes, and an association between DMRs and genes, the correlation between both types of data can be examined in order to find correlating DMRs (cDMRs). For each DMR-gene pair, a linear and non-linear correlation coefficient is calculated and a correlation plot (Figure 2G), showing methylation and expression of each sample, is generated.
BAT has already successfully been applied in the framework of a large cancer genome study, the ICGC MMML-Seq11. The streamlined processing and analysis modules improve and accelerate the analysis by reducing hands on time and user errors. The modularity of BAT, as well as its input and output formats, allow to easily extend or customize the default workflows. For instance, it is possible to easily integrate tools such as BisSNP19 or BS-Snper20 or DMR calling tools.
The custom visualizations of the methylation data facilitate data mining and allow to inspect the data quality at each step of the analysis. This is necessary to increase the chance of an early detection of errors, e.g., in library preparation and data handling. Therefore, quality control statistics and graphics are produced continually throughout the entire pipeline.
Taken together, BAT is a collection of modular steps for analyzing bisulfite sequencing data that (i) can easily be run on various platforms due to the virtualization via Docker, (ii) can be combined with or extended by other tools, (iii) automatically generates publication-ready graphics, and (iv) supports data integration, e.g., annotation or gene expression data.
Software available from: www.bioinf.uni-leipzig.de/Software/BAT/download
Source code available from: https://github.com/helenebioinf/BAT
Archived source code as at time of publication: http://doi.org/10.5281/zenodo.83820021.
License: MIT
Example data available from: www.bioinf.uni-leipzig.de/Software/BAT/download/#example_data
This research was supported by the German BMBF (ICGC MMML-Seq 01KU1002A-J, and ICGC-Data Mining 01KU1505-C and G) the European Union in the framework of the BLUEPRINT Project (HEALTH-F5-2011-282510) and LIFE (Leipzig Research Center for Civilization Diseases), Leipzig University. LIFE is funded by the European Union, by the European Regional Development Fund (ERDF), the European Social Fund (ESF) and by the Free State of Saxony within the excellence initiative. We acknowledge support from the German Research Foundation (DFG) and University of Leipzig within the program of Open Access Publishing.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We thank Stephan H. Bernhart for helpful discussion and proof reading. We acknowledge support from the German Research Foundation (DFG) and Universität Leipzig within the program of Open Access Publishing.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Epigenetics, Cancer genomics, Software development
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
References
1. Kretzmer H, Bernhart SH, Wang W, Haake A, et al.: DNA methylome analysis in Burkitt and follicular lymphomas identifies differentially methylated regions linked to somatic mutation and transcriptional control.Nat Genet. 2015; 47 (11): 1316-1325 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, evolution and development
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 1 16 Aug 17 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)