A cross-package Bioconductor workflow for analysing methylation array data

Methylation in the human genome is known to be associated with development and disease. The Illumina Infinium methylation arrays are by far the most common way to interrogate methylation across the human genome. This paper provides a Bioconductor workflow using multiple packages for the analysis of methylation array data. Specifically, we demonstrate the steps involved in a typical differential methylation analysis pipeline including: quality control, filtering, normalization, data exploration and statistical testing for probe-wise differential methylation. We further outline other analyses such as differential methylation of regions, differential variability analysis, estimating cell type composition and gene ontology testing. Finally, we provide some examples of how to visualise methylation array data.

This article is included in the Bioconductor channel. v3 Introduction DNA methylation, the addition of a methyl group to a CG dinucleotide of the DNA, is the most extensively studied epigenetic mark due to its role in both development and disease (Bird, 2002;Laird, 2003). Although DNA methylation can be measured in several ways, the epigenetics community has enthusiastically embraced the Illumina HumanMeth-ylation450 (450k) array (Bibikova et al., 2011) as a cost-effective way to assay methylation across the human genome. More recently, Illumina has increased the genomic coverage of the platform to >850,000 sites with the release of their MethylationEPIC (850k) array. As methylation arrays are likely to remain popular for measuring methylation for the foreseeable future, it is necessary to provide robust workflows for methylation array analysis.
Measurement of DNA methylation by Infinium technology (Infinium I) was first employed by Illumina on the HumanMethylation27 (27k) array (Bibikova et al., 2009), which measured methylation at approximately 27,000 CpGs, primarily in gene promoters. Like bisulfite sequencing, the Infinium assay detects methylation status at single base resolution. However, due to its relatively limited coverage the array platform was not truly considered "genome-wide" until the arrival of the 450k array. The 450k array increased the genomic coverage of the platform to over 450,000 gene-centric sites by combining the original Infinium I assay with the novel Infinium II probes. Both assay types employ 50bp probes that query a [C/T] polymorphism created by bisulfite conversion of unmethylated cytosines in the genome, however, the Infinium I and II assays differ in the number of beads required to detect methylation at a single locus. Infinium I uses two bead types per CpG, one for each of the methylated and unmethylated states ( Figure 1a). In contrast, the Infinium II design uses one bead type and the methylated state is determined at the single base extension step after hybridization ( Figure 1b). The 850k array also uses a combination of the Infinium I and II assays but achieves additional coverage by increasing the size of each array; a 450k slide contains 12 arrays whilst the 850k has only 8.
Regardless of the Illumina array version, for each CpG, there are two measurements: a methylated intensity (denoted by M) and an unmethylated intensity (denoted by U). These intensity values can be used to determine the proportion of methylation at each CpG locus. Methylation levels are commonly reported as either beta values (β = M/(M + U)) or M-values (M value = log2(M/U)). For practical purposes, a small offset, α, can be added to the denominator of the β value equation to avoid dividing by small values, which is the default behaviour of the getBeta function in minfi. The default value for α is 100. It may also be desirable to add a small offset to the numerator and denominator when calculating M-values to avoid dividing by zero in rare cases, however the default getM function in minfi does not do this. Beta values and M-values are related through a logit transformation. Beta values are generally preferable for describing the level of methylation at a locus or for graphical presentation because percentage methylation is easily interpretable. However, due to their distributional properties, M-values are more appropriate for statistical testing (Du et al., 2010).
In this workflow, we will provide examples of the steps involved in analysing methylation array data using R (R Core Team, 2014) and Bioconductor (Huber et al., 2015), including: quality control, filtering, normalisation, data exploration and probe-wise differential methylation analysis. We will also cover other approaches such as differential methylation analysis of regions, differential variability analysis, gene ontology analysis and estimating cell type composition. Finally, we will provide some examples of useful ways to visualise methylation array data.
# the URL for the data download url <-"https://ndownloader.figshare.com/files/7896205" # download the data if(!file.exists("methylAnalysisDataV3.tar.gz")){ download.file(url, destfile="methylAnalysisDataV3.tar.gz", method="auto") } # extract the data if(!file.exists("./data")){ untar("methylAnalysisDataV3.tar.gz", exdir=".", compressed="gzip") } Once the data has been downloaded and extracted, there should be a folder called data that contains all the files necessary to execute the workflow. Each individual CpG is interrogated using two bead types: methylated (M) and unmethylated (U). Both bead types will incorporate the same labeled nucleotide for the same target CpG, thereby producing the same color fluorescence. The nucleotide that is added is determined by the base downstream of the "C" of the target CpG. The proportion of methylation can be calculated by comparing the intensities from the two different probes in the same color. (b) Infinium II assay. Each target CpG is interrogated using a single bead type. Methylation state is detected by single base extension at the position of the "C" of the target CpG, which always results in the addition of a labeled "G" or "A" nucleotide, complementary to either the "methylated" C or "unmethylated" T, respectively. Each locus is detected in two colors, and methylation status is determined by comparing the two colors from the one position.
To demonstrate the various aspects of analysing methylation data, we will be using a small, publicly available 450k methylation dataset (GSE49667) (Zhang et al., 2013). The dataset contains 10 samples in total: there are 4 different sorted T-cell types (naive, rTreg, act_naive, act_rTreg, collected from 3 different individuals (M28, M29, M30). For details describing sample collection and preparation, see Zhang et al. (2013). An additional birth sample (individual VICS-72098-18-B) is included from another study (GSE51180) (Cruickshank et al., 2013) to illustrate approaches for identifying and excluding poor quality samples.
There are several R Bioconductor packages available that have been developed for analysing methylation array data, including minfi (Aryee et al., 2014), missMethyl (Phipson et al., 2016), wateRmelon (Pidsley et al., 2013), methylumi (Davis et al., 2015), ChAMP (Morris et al., 2014) and charm (Aryee et al., 2011). Some of the packages, such as minfi and methylumi include a framework for reading in the raw data from IDAT files and various specialised objects for storing and manipulating the data throughout the course of an analysis. Other packages provide specialised analysis methods for normalisation and statistical testing that rely on either minfi or methylumi objects. It is possible to convert between minfi and methylumi data types, however, this is not always trivial. Thus, it is advisable to consider the methods that you are interested in using and the data types that are most appropriate before you begin your analysis. Another popular method for analysing methylation array data is limma (Ritchie et al., 2015), which was originally developed for gene expression microarray analysis. As limma operates on a matrix of values, it is easily applied to any data that can be converted to a matrix in R. For a complete list of Bioconductor packages for analysing DNA methylation data, one can search for "DNAMethylation" in BiocViews (https://www.bioconductor.org/packages/ release/BiocViews.html#___DNAMethylation) on the Bioconductor website.
We will begin with an example of a probe-wise differential methylation analysis using minfi and limma. By probe-wise analysis we mean each individual CpG probe will be tested for differential methylation for the comparisons of interest and p-values and moderated t-statistics (Smyth, 2004) will be generated for each CpG probe.
Loading the data It is useful to begin an analysis in R by loading all the packages that are likely to be required.
# load packages required for analysis library(limma) library(minfi) library(IlluminaHumanMethylation450kanno.ilmn12.hg19) library(IlluminaHumanMethylation450kmanifest) library(RColorBrewer) library(missMethyl) library(matrixStats) library(minfiData) library(Gviz) library(DMRcate) library(stringr) The minfi, IlluminaHumanMethylation450kanno.ilmn12.hg19, IlluminaHumanMethylation450kmanifest, missMethyl, minfiData and DMRcate are methylation specific packages, while RColorBrewer and Gviz are visualisation packages. We use limma for testing differential methylation, and matrixStats and stringr have functions used in the workflow. The IlluminaHumanMethylation450kmanifest package provides the Illumina manifest as an R object which can easily be loaded into the environment. The manifest contains all of the annotation information for each of the CpG probes on the 450k array. This is useful for determining where any differentially methylated probes are located in a genomic context. As for their many other BeadArray platforms, Illumina methylation data is usually obtained in the form of Intensity Data (IDAT) Files. This is a proprietary format that is output by the scanner and stores summary intensities for each probe on the array. However, there are Bioconductor packages available that facilitate the import of data from IDAT files into R (Smith et al., 2013). Typically, each IDAT file is approximately 8MB in size. The simplest way to import the raw methylation data into R is using the minfi function read.metharray.sheet, along with the path to the IDAT files and a sample sheet. The sample sheet is a CSV (comma-separated) file containing one line per sample, with a number of columns describing each sample. The format expected by the read.metharray.sheet function is based on the sample sheet file that usually accompanies Illumina methylation array data. It is also very similar to the targets file described by the limma package. Importing the sample sheet into R creates a data.frame with one row for each sample and several columns. The read.metharray.sheet function uses the specified path and other information from the sample sheet to create a column called Basename which specifies the location of each individual IDAT file in the experiment.

Quality control
Once the data has been imported into R, we can evaluate its quality. Firstly, we need to calculate detection p-values. We can generate a detection p-value for every CpG in every sample, which is indicative of the quality of the signal. The method used by minfi to calculate detection p-values compares the total signal (M + U) for each probe to the background signal level, which is estimated from the negative control probes. Very small p-values are indicative of a reliable signal whilst large p-values, for example >0.01, generally indicate a poor quality signal.
Plotting the mean detection p-value for each sample allows us to gauge the general quality of the samples in terms of the overall signal reliability ( Figure 2). Samples that have many failed probes will have relatively large mean detection p-values.
qcReport(rgSet, sampNames=targets$ID, sampGroups=targets$Sample_Group, pdf="qcReport.pdf") Poor quality samples can be easily excluded from the analysis using a detection p-value cutoff, for example >0.05. For this particular dataset, the birth sample shows a very high mean detection p-value, and hence it is excluded from subsequent analysis (Figure 2

Normalisation
To minimise the unwanted variation within and between samples, various data normalisations can be applied. Many different types of normalisation have been developed for methylation arrays and it is beyond the scope of this workflow to compare and contrast all of them (Fortin et al., 2014;Maksimovic et al., 2012;Mancuso et al., 2011;Pidsley et al., 2013;Sun et al., 2011;Teschendorff et al., 2013;Touleimat & Tost, 2012;Triche et al., 2013;Wang et al., 2012;Wu et al., 2014). Several methods have been built into minfi and can be directly applied within its framework (Fortin et al., 2014;Maksimovic et al., 2012;Triche et al., 2013;Touleimat & Tost, 2012), whilst others are methylumi-specific or require custom data types (Mancuso et al., 2011;Pidsley et al., 2013;Sun et al., 2011;Teschendorff et al., 2013;Wang et al., 2012;Wu et al., 2014). Although there is no single normalisation method that is universally considered best, a recent study by Fortin et al. (2014) has suggested that a good rule of thumb within the minfi framework is that the preprocessFunnorm (Fortin et al., 2014) function is most appropriate for datasets with global methylation differences such as cancer/normal or vastly different tissue types, whilst the preprocessQuantile function (Touleimat & Tost, 2012) is more suited for datasets where you do not expect global differences between your samples, for example a single tissue. Further discussion on appropriate choice of normalisation can be found in (Hicks & Irizarry, 2015), and the accompanying quantro package includes data-driven tests for the assumptions of quantile normalisation. As we are comparing different blood cell types, which are globally relatively similar, we will apply the preprocessQuantile method to our data ( Figure 3). This function implements a stratified quantile normalisation procedure which is applied to the methylated and unmethylated signal intensities separately, and takes into account the different probe types. Note that after normalisation, the data is housed in a GenomicRatioSet object. This is a much more compact representation of the data as the colour channel information has been discarded and the M and U intensity information has been converted to M-values and beta values, together with associated genomic coordinates. Note, running the preprocessQuantile function on this dataset produces the warning: 'An inconsistency was encountered while determining sex'; this can be ignored as it is due to all the samples being from male donors. # visualise what the data looks like before and after normalisation par(mfrow=c(1,2)) densityPlot(rgSet, sampGroups=targets$Sample_Group,main="Raw", legend=FALSE) legend("top", legend = levels(factor(targets$Sample_Group)), text.col=brewer.pal(8,"Dark2")) densityPlot(getBeta(mSetSq), sampGroups=targets$Sample_Group, main="Normalized", legend=FALSE) legend("top", legend = levels(factor(targets$Sample_Group)), text.col=brewer.pal(8,"Dark2"))

Data exploration
Multi-dimensional scaling (MDS) plots are excellent for visualising data, and are usually some of the first plots that should be made when exploring the data. MDS plots are based on principal components analysis and are an unsupervised method for looking at the similarities and differences between the various samples. Samples that are more similar to each other should cluster together, and samples that are very different should be further apart on the plot. Dimension one (or principal component one) captures the greatest source of variation in the data, dimension two captures the second greatest source of variation in the data and so on. Colouring the data points or labels by known factors of interest can often highlight exactly what the greatest sources of variation are in the data. It is also possible to use MDS plots to decipher sample mix-ups.

Filtering
Poor performing probes are generally filtered out prior to differential methylation analysis. As the signal from these probes is unreliable, by removing them we perform fewer statistical tests and thus incur a reduced multiple testing penalty. We filter out probes that have failed in one or more samples based on detection p-value.
# ensure probes are in the same order in the mSetSq and detP objects detP <-detP[match(featureNames(mSetSq),rownames(detP)),] # remove any probes that have failed in one or more samples keep <-rowSums(detP < 0.01) == ncol(mSetSq)   Depending on the nature of your samples and your biological question you may also choose to filter out the probes from the X and Y chromosomes or probes that are known to have common SNPs at the CpG site. As the samples in this dataset were all derived from male donors, we will not be removing the sex chromosome probes as part of this analysis, however example code is provided below. A different dataset, which contains both male and female samples, is used to demonstrate a Differential Variability analysis and provides an example of when sex chromosome removal is necessary ( Figure 13).

# if your data includes males and females, remove probes on the sex chromosomes
There is a function in minfi that provides a simple interface for the removal of probes where common SNPs may affect the CpG. You can either remove all probes affected by SNPs (default), or only those with minor allele frequencies greater than a specified value. We will also filter out probes that have shown to be cross-reactive, that is, probes that have been demonstrated to map to multiple places in the genome. This list was originally published by Chen et al. (2013) and can be obtained from the authors' website.
We are interested in pairwise comparisons between the four cell types, taking into account individual to individual variation. We perform this analysis on the matrix of M-values in limma, obtaining moderated t-statistics and associated p-values for each CpG site. A convenient way to set up the model when the user has many comparisons of interest that they would like to test is to use a contrasts matrix in conjunction with the design matrix. A contrasts matrix will take linear combinations of the columns of the design matrix corresponding to the comparisons of interest.
Since we are performing hundreds of thousands of hypothesis tests, we need to adjust the p-values for multiple testing. A common procedure for assessing how statistically significant a change in mean levels is between two groups when a very large number of tests is being performed is to assign a cut-off on the false discovery rate (Benjamini & Hochberg, 1995), rather than on the unadjusted p-value. Typically 5% FDR is used, and this is interpreted as the researcher willing to accept that from the list of significant differentially methylated CpG sites, 5% will be false discoveries. If the p-values are not adjusted for multiple testing, the number of false discoveries will be unacceptably high. For this dataset, assuming a Type I error rate of 5%, we would expect to see 0.05*439918=21996 statistical significant results for a given comparison, even if there were truly no differentially methylated CpG sites. Based on a false discovery rate of 5%, there are 3021 significantly differentially methylated CpGs in the naïve vs rTreg comparison, while rTreg vs act_rTreg doesn't show any significant differential methylation.
# this is the factor of interest cellType <-factor(targets$Sample_Group) # this is the individual effect that we need to account for individual <-factor(targets$Sample_Source) # use the above to create a design matrix design <-model.matrix(~0+cellType+individual, data=targets) colnames(design) <-c(levels(cellType),levels (individual) We can extract the tables of differentially expressed CpGs for each comparison, ordered by B-statistic by default, using the topTable function in limma. The B-statistic is the log-odds of differential methylation, first published by Lonnstedt and Speed (Lonnstedt & Speed, 2002). To order by p-value, the user can specify sort.by="p"; and in most cases, the ordering based on the p-value and ordering based on the B-statistic will be identical. The results of the analysis for the first comparison, naive vs. rTreg, can be saved as a data.frame by setting coef=1. The coef parameter explicitly refers to the column in the contrasts matrix which corresponds to the comparison of interest. The resulting data.frame can easily be written to a CSV file, which can be opened in Excel.
write.table(DMPs, file="DMPs.csv", sep=",", row.names=FALSE) It is always useful to plot sample-wise methylation levels for the top differentially methylated CpG sites to quickly ensure the results make sense (Figure 9). If the plots do not look as expected, it is usually an indication of an error in the code, or in setting up the design matrix. It is easier to interpret methylation levels on the beta value scale, so although the analysis is performed on the M-value scale, we visualise data on the beta value scale. The plotCpg function in minfi is a convenient way to plot the sample-wise beta values stratified by the grouping variable.

Differential methylation analysis of regions
Although performing a probe-wise analysis is useful and informative, sometimes we are interested in knowing whether several proximal CpGs are concordantly differentially methylated, that is, we want to identify differentially methylated regions. There are several Bioconductor packages that have functions for identifying differentially methylated regions from 450k data. Some of the most popular are the dmrFind function in the charm package, which has been somewhat superseded for 450k arrays by the bumphunter function in minfi (Aryee et al., 2014;Jaffe et al., 2012), and, the recently published dmrcate in the DMRcate package (Peters et al., 2015). They are each based on different statistical methods. In our experience, the bumphunter and dmrFind functions can be somewhat slow to run unless you have the computer infrastructure to parallelise them, as they use permutations to assign significance. In this workflow, we will perform an analysis using the dmrcate. As it is based on limma, we can directly use the design and contMatrix we previously defined.
Firstly, our matrix of M-values is annotated with the relevant information about the probes such as their genomic position, gene annotation, etc. By default, this is done using the ilmn12.hg19 annotation, but this can be substituted for any argument compatible with the interface provided by the minfi package. The limma pipeline is then used for differential methylation analysis to calculate moderated t-statistics. myAnnotation <-cpg.annotate(object = mVals, datatype = "array", what = "M", analysis.type = "differential", design = design, contrasts = TRUE, cont.matrix = contMatrix, coef = "naive -rTreg", arraytype = "450K") ## Your contrast returned 3021 individually significant probes. ## We recommend the default setting of pcutoff in dmrcate().   As for the probe-wise analysis, it is advisable to visualise the results to ensure that they make sense. The regions can easily be viewed using the DMR.plot function provided in the DMRcate package ( Figure 10).

Customising visualisations of methylation data
The Gviz package offers powerful functionality for plotting methylation data in its genomic context. The package vignette is very extensive and covers the various types of plots that can be produced using the Gviz framework. We will plot one of the differentially methylated regions from the DMRcate analysis to demonstrate the type of visualisations that can be created ( Figure 11).
We will first set up the genomic region we would like to plot by extracting the genomic coordinates of the top differentially methylated region.

Additional analyses
Gene ontology testing Once you have performed a differential methylation analysis, there may be a very long list of significant CpG sites to interpret. One question a researcher may have is, "which gene pathways are over-represented for differentially methylated CpGs?" In some cases it is relatively straightforward to link the top differentially methylated CpGs to genes that make biological sense in terms of the cell types or samples being studied, but there may be many thousands of CpGs significantly differentially methylated. In order to gain an understanding of the biological processes that the differentially methylated CpGs may be involved in, we can perform gene ontology or KEGG pathway analysis using the gometh function in the missMethyl package (Phipson et al., 2016).
Let us consider the first comparison, naive vs rTreg, with the results of the analysis in the DMPs table. The gometh function takes as input a character vector of the names (e.g. cg20832020) of the significant CpG sites, and optionally, a character vector of all CpGs tested. This is recommended particularly if extensive filtering of the CpGs has been performed prior to analysis. For gene ontology testing (default), the user can specify collection="GO". For testing KEGG pathways, specify collection="KEGG". In the DMPs table, the Name column corresponds to the CpG name. We will select all CpG sites that have adjusted p-value of less than 0.05.

## [1] 3021
# Get all the CpG sites used in the analysis to form the background all <-DMPs$Name # Total number of CpG sites tested length(all)

## [1] 439918
The gometh function takes into account the varying numbers of CpGs associated with each gene on the Illumina methylation arrays. For the 450k array, the numbers of CpGs mapping to genes can vary from as few as 1 to as many as 1200. The genes that have more CpGs associated with them will have a higher probability of being identified as differentially methylated compared to genes with fewer CpGs. We can look at this bias in the data by specifying plot=TRUE in the call to gometh (Figure 12). par(mfrow=c(1,1)) gst <-gometh(sig.cpg=sigCpGs, all.cpg=all, plot.bias=TRUE) ## Warning in alias2SymbolTable(flat$symbol): Multiple symbols ignored for one ## or more aliases The gst object is a data.frame with each row corresponding to the GO category being tested. Note that the warning regarding multiple symbols will always be displayed as there are genes that have more than one alias, however it is not a cause for concern.
The top 20 gene ontology categories can be displayed using the topGO function. For KEGG pathway analysis, the topKEGG function can be called to display the top 20 enriched pathways. From the output we can see many of the top GO categories correspond to immune system and T cell processes, which is unsurprising as the cell types being studied form part of the immune system. Typically, we consider GO categories that have associated false discovery rates of less than 5% to be statistically significant. If there aren't any categories that achieve this significance it may be useful to scan the top 5 or 10 highly ranked GO categories to gain some insight into the biological system.
The gometh function only tests GO and KEGG pathways. For a more generalised version of gene set testing for methylation data where the user can specify the gene set to be tested, the gsameth function can be used. To display the top 20 pathways, topGSA can be called. gsameth accepts a single gene set, or a list of gene sets. The gene identifiers in the gene set must be Entrez Gene IDs. To demonstrate gsameth, we are using the curated genesets (C2) from the Broad Institute Molecular signatures database. These can be downloaded as an RData object from the WEHI Bioinformatics website.
# load Broad human curated (C2) gene sets load(paste(dataDirectory,"human_c2_v5.rdata",sep="/")) # perform the gene set test(s) gsa <-gsameth(sig.cpg=sigCpGs, all.cpg=all, collection=Hs.c2) ## Warning in alias2SymbolTable(flat$symbol): Multiple symbols ignored for one ## or more aliases While gene set testing is useful for providing some biological insight in terms of what pathways might be affected by abberant methylation, care should be taken not to over-interpret the results. Gene set testing should be used for the purpose of providing some biological insight that ideally would be tested and validated in further laboratory experiments. It is important to keep in mind that we are not observing gene level activity such as in RNA-Seq experiments, and that we have had to take an extra step to associate CpGs with genes.

Differential variability
Rather than testing for differences in mean methylation, we may be interested in testing for differences between group variances. For example, it has been hypothesised that highly variable CpGs in cancer may contribute to tumour heterogeneity (Hansen et al., 2011). Hence we may be interested in CpG sites that are consistently methylated in one group, but variably methylated in another group.
Sample size is an important consideration when testing for differentially variable CpG sites. In order to get an accurate estimate of the group variances, larger sample sizes are required than for estimating group means. A good rule of thumb is to have at least ten samples in each group (Phipson & Oshlack, 2014). To demonstrate testing for differentially variable CpG sites, we will use a publicly available dataset on ageing GSE30870, where whole blood samples were collected from 18 centenarians and 18 newborns and profiled for methylation on the 450k array (Heyn et al., 2012). The data (age.rgSet) and sample information (age.targets) have been included as an R data object in the data archive you previously downloaded from figshare. We can load the data using the load command, after which it needs to be normalised and filtered as previously described.
# load data load(paste(dataDirectory,"ageData.RData",sep="/")) # remove probes with SNPs at CpG or single base extension (SBE) site age.mSetSqFlt <-dropLociWithSnps(age.mSetSqFlt, snps = c("CpG", "SBE")) # remove cross-reactive probes keep <-!(featureNames(age.mSetSqFlt) %in% xReactiveProbes$TargetID) age.mSetSqFlt <-age.mSetSqFlt [keep,] As this dataset contains samples from both males and females, we can use it to demonstrate the effect of removing sex chromosome probes on the data. The MDS plots below show the relationship between the samples in the ageing dataset before and after sex chromosome probe removal ( Figure 13). It is apparent that before the removal of sex chromosome probes, the sample cluster based on sex in the second principal component. When the sex chromosome probes are removed, age is the largest source of variation present and the male and female samples no longer form separate clusters. # tag sex chromosome probes for removal keep <-!(featureNames(age.mSetSqFlt) %in% ann450k$Name[ann450k$chr %in% c("chrX","chrY")]) age.pal <-brewer.pal(8,"Set1") par(mfrow=c(1,2)) plotMDS(getM(age.mSetSqFlt), top=1000, gene.selection="common", col=age.pal[factor(age.targets$Sample_Group)], labels=age.targets$Sex, main="With Sex CHR Probes") legend("topleft", legend=levels(factor(age.targets$Sample_Group)), text.col=age.pal) plotMDS(getM(age.mSetSqFlt[keep,]), top=1000, gene.selection="common", col=age.pal[factor(age.targets$Sample_Group)], labels=age.targets$Sex, main="Without Sex CHR Probes") legend("top", legend=levels(factor(age.targets$Sample_Group)), text.col=age.pal) # remove sex chromosome probes from data age.mSetSqFlt <-age.mSetSqFlt [keep,] We can test for differentially variable CpGs using the varFit function in the missMethyl package. The syntax for specifying which groups we are interested in testing is slightly different to the standard way a model is specified in limma, particularly for designs where an intercept is fitted (see missMethyl vignette for further details). For the ageing data, the design matrix includes an intercept term, and a term for age. The coef argument in the varFit function indicates which columns of the design matrix correspond to the intercept and grouping factor. Thus, for the ageing dataset we set coef=c(1,2). Note that design matrices without intercept terms are permitted, with specific contrasts tested using the contrasts.varFit function. Similarly to the differential methylation analysis, is it useful to plot sample-wise beta values for the differentially variable CpGs to ensure the significant results are not driven by artifacts or outliers (Figure 14). # get beta values for ageing data age.bVals <-getBeta(age.mSetSqFlt) par(mfrow=c(2,2)) sapply(rownames(topDV)[1:4], function(cpg){ plotCpg(age.bVals, cpg=cpg, pheno=age.targets$Sample_Group, ylab = "Beta values") }) An example of testing for differential variability when the design matrix does not have an intercept term is detailed in the missMethyl vignette.

Cell type composition
As methylation is cell type specific and methylation arrays provide CpG methylation values for a population of cells, biological findings from samples that are comprised of a mixture of cell types, such as blood, can be confounded with cell type composition (Jaffe & Irizarry, 2014). The minfi function estimateCellCounts facilitates the estimation of the level of confounding between phenotype and cell type composition in a set of samples. The function uses a modified version of the method published by Houseman et al. (2012) and the package FlowSorted.Blood.450k, which contains 450k methylation data from sorted blood cells, to estimate the cell type composition of blood samples.

Discussion
Here we present a commonly used workflow for methylation array analysis based on a series of Bio-conductor packages. While we have not included all the possible functions or analysis options that are available for detecting differential methylation, we have demonstrated a common and well used workflow that we regularly use in our own analysis. Specifically, we have not demonstrated more complex types of analyses such as removing unwanted variation in a differential methylation study (Leek et al., 2012;Maksimovic et al., 2015;Teschendorff et al., 2011), block finding (Aryee et al., 2014Hansen et al., 2011) or A/B compartment prediction (Fortin & Hansen, 2015. Our differential methylation workflow presented here demonstrates how to read in data, perform quality control and filtering, normalisation and differential methylation testing. In addition we demonstrate analysis for differential variability, gene set testing and estimating cell type composition. One important aspect of exploring results of an analysis is visualisation and we also provide an example of generating region-level views of the data.

Software versions
The R markdown file and/or R script for this version workflow can be downloaded from: https://github.com/ Oshlack/MethylationAnalysisWorkflow/tree/version3/scripts. A live version of this workflow that is regularly built using updated Bioconductor packages can be accessed on the Bioconductor website: https://www.bioconductor.org/ help/workflows/methylationArrayAnalysis/. The version of the workflow presented here uses the following packages available from Bioconductor (release 3.4):   Author contributions JM and BP designed the content and wrote the paper. AO oversaw the project and contributed to the writing and editing of the paper.

Competing interests
No competing interests were disclosed. As someone who has experience with R/Bioconductor and with genomics data, but not direct experience analyzing methylation array data, I found the workflow very useful and I would suggest it to anyone wanting to start analyzing this type of data.

Grant information
I do agree with the other reviewers that the value of the workflow will be greatly increased if the dataset used was available as an R object. The authors should consider submitting an experiment data package to Bioconductor to accompany the workflow. Alternatively, they could provide the dataset as a supplementary file.
As for the analysis itself, I only have one major question. Note that I do not have direct experience analyzing methylation array, so this is a genuine question rather than a criticism.
In gene expression analysis, we tend to perform filtering prior to normalization, while the authors here first normalize the data by quantile normalization and then filter out probes that are low quality and/or affected by SNPs. Wouldn't it be safer to perform filtering before normalization? I understand that given the few probes affected, the order has likely very little effect in this dataset. But I naively imagine that if there are many problematic probes and, say, the quality of the samples is confounded with the biology, there could many problematic probes and, say, the quality of the samples is confounded with the biology, there could be issues in using low quality probes for normalization.
Other minor points: I agree that the code should be re-run with the latest release of R and Bioconductor.
In the definition of \beta, \alpha should be defined, and its default value in getBeta() should be specified.
Spelling: most of the article uses British English spelling, but the word "normalization" is sometimes (but not always) spelled in American English.
A sentence describing what is the procedure implemented in preprocessQuantile() is needed for people not familiar with normalization.
I agree that it would be useful to provide a brief description of what is a contrasts matrix as this section could be confusing for people unfamiliar with statistical models.
For the same reason, the authors should add a brief explanation of the problem of multiple testing and what is the false discovery rate. Or at least provide references to the appropriate literature.

I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. While we agree that normalisation post-filtering makes sense, there are some practical aspects with the data objects that minfi uses which makes this difficult. Many (but not all) normalisation procedures in minfi accept an rgSet object, which can be thought of as a raw data object, which cannot easily be subset by CpG site. These normalisation procedures then output a different type of data object, such as MethylSet or GenomicRatioSet, which are much easier to work with in terms of filtering out problematic CpG sites. Due to the sheer number of CpG sites observed per sample (>450,000) we believe it shouldn't make too much difference for most datasets, especially if very poor quality samples are excluded prior to normalisation, although it is possible that there are exceptions to this.

Response to minor points:
We have spent some time modifying the workflow to run with the latest R and Bioconductor. We have added additional details regarding beta values, M-values and the offset in the paper. We have changed "normalization" to "normalisation" throughout the text. A sentence has been added about preprocessQuantile in the normalisation section. We have included some additional explanation of contrast matrices.
An additional paragraph was added explaining about the issues of multiple testing in very what it is this function does, for the benefit of someone who has never performed linear modeling before. Likewise, to explicitly state that coef=1 is referencing the first column of the contrast matrix. It should be stated what is the B-statistic which orders the topTable.
The authors should explain a bit more what is being shown in Figure 10 in the caption.
In the text and code the authors have written DNAseI, but I believe the more common capitalization is DNaseI.
The authors might consider commenting on the top GO categories and the associated FDR values. How far down the list should one look? Can the authors advise the reader how GO results should be reported? Is it fair to pick out the most relevant categories from this list and only report them?
It wasn't clear to me the difference between the and approaches. gometh gsameth It would be good to provide references to literature for "it has been hypothesised that highly variable CpGs in cancer are important for tumour progression". Thank you for reviewing our paper, Michael. In response to your comments/suggestions we have made the following changes:

References
The Smyth 2004 citations have been added the first time "moderated t-statistics" is mentioned A description of IDAT files has been added to the text along with a reference to a Bioconductor package that is specifically for reading IDAT files.
We have added to the legend for Figure 2 to clarify that the plot on the right is the same as the left but zoomed in.
There is now a reference to Hicks and quantro in included in the Normalisation section. Spelling mistakes and typos have been fixed. Function of is described: See response to Davide Risso. makeContrasts We now explicitly state that coef=1 is referencing the first column of the contrast matrix. Included explanation for B-statistic and citation. More detail about the plot has been added to the figure caption for Figure 10. Changed DNAseI to DNasel Typically we would consider GO categories that have associated FDRs less than 5% as significant. Some discussion of these points has been added to the gene set testing section.
The gometh function specifically tests only GO and KEGG pathways, whereas the gsameth No competing interests were disclosed.

Competing Interests:
Referee Response 06 Jul 2016 , Johns Hopkins Bloomberg School of Public Health, USA Peter Hickey p2: beta = M / (M + U + alpha), the alpha parameter should be explained. Also, both the definition of beta and Mvalue differ slightly from that given in the cited Du, P. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010).
p4: This is *super* pedantic (sorry!): strictly speaking the IlluminaHumanMethylation450kmanifest package provides the Illumina manifest for the 450k array, which can then be accessed by using `minfi::getAnnotation()` Figure 2: Not immediately obvious that righthand plot is zoomed in version of lefthand plot. The caption could better explain this. p10: The code produces a warning. Would be helpful to the reader to comment on whether this is cause for concern in this case. p25: `islandData` apparently contains 0 ranges. This looks like a bug in the code.
p28: "For gene ontology testing (default), the user can specific collection = "GO" for KEGG testing collection = "KEGG""; this sentence seems incomplete or is perhaps missing a word p29 and p30: The code produces a warning. Would be helpful to the reader to comment on whether this is cause for concern in this case.
The workflow uses multiple packages and it's not always clear where each function comes from. This could be clarified e.g., by namespacing functions such as `limma::plotMDS()ì nstead of `plotMDS()` No competing interests were disclosed. Thanks for taking the time to review our workflow, Peter. In response to your suggestion we have made the data available and rerun the workflow using the latest R and Bioconductor. In response to your other comments: p2: beta = M / (M + U + alpha), the alpha parameter should be explained. Also, both the definition of beta and Mvalue differ slightly from that given in the cited Du, P. et al.
Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). This has been addressed. See response to Davide Risso. p3: Perhaps worth mentioning that a complete list of packages for analysing DNA methylation data can be accessed using BiocViews (https://www.bioconductor.org/packages/release/bioc/html/biocViews.html and https://www.bioconductor.org/packages/release/BiocViews.html#___DNAMethylation) This has been added to the paper. p4: "...loading all the package libraries..." should be "...loading all the packages..." The text has been modified accordingly. p4: Perhaps worth commenting on which of the loaded packages are methylation-focused and/or purpose of other packages, e.g., stringr, Gviz. The text has been modified accordingly. p4: This is *super* pedantic (sorry!): strictly speaking the IlluminaHumanMethylation450kmanifest package provides the Illumina manifest for the 450k array, which can then be accessed by using `minfi::getAnnotation()T he text has been modified accordingly. Figure 2: Not immediately obvious that righthand plot is zoomed in version of lefthand plot. The caption could better explain this. This has been clarified in the figure caption. p10: The code produces a warning. Would be helpful to the reader to comment on whether this is cause for concern in this case. A sentence has been included that explains the reason for the waring. Figure 9: Wondering whether helpful to have each panel with y-axis = [0, 1] As we are trying to highlight the differences between the groups tested for CpGs individual and not comparing between CpGs, we feel that the axes are appropriate for the purposes of "sanity checking" the results of the statistical analysis. p25: `islandData` apparently contains 0 ranges. This looks like a bug in the code. This was due to the fact that there were not any CpG islands present in the region being plotted; we have selected another region to plot that does have a CpG island so that islandData is no longer empty. p28: "For gene ontology testing (default), the user can specific collection = "GO" for KEGG testing collection = "KEGG""; this sentence seems incomplete or is perhaps missing a word This sentence has been modified. p29 and p30: The code produces a warning. Would be helpful to the reader to comment on whether this is cause for concern in this case. Added a sentence to the text to explain the warning. The workflow uses multiple packages and it's not always clear where each function comes from. This could be clarified e.g., by namespacing functions such as `limma::plotMDS()ì nstead of `plotMDS()Ẁ e don't feel it is a particularly useful exercise to change every function to include the package name. Searching the help for any of the functions will inform users which package