An end to end workflow for differential gene expression using Affymetrix microarrays

In this article, we walk through an end–to–end Affymetrix microarray differential expression workflow using Bioconductor packages. This workflow is directly applicable to current “Gene” type arrays, e.g. the HuGene or MoGene arrays but can easily adapted to similar platforms. The data re–analyzed is a typical clinical microarray data set that compares inflammed and non–inflammed colon tissue in two disease subtypes. We will start from the raw data CEL files, show how to import them into a Bioconductor ExpressionSet, perform quality control and normalization and finally differential gene expression (DE) analysis, followed by some enrichment analysis. As experimental designs can be complex, a self contained introduction to linear models is also part of the workflow.


Introduction
In this article we introduce a complete workflow for a typical (Affymetrix) microarray analysis. Data import, preprocessing, differential expression and enrichment analysis are discussed. We also introduce some necessary mathematical background on linear models along the way.
The data set used 1 is from a paper studying the differences between patients suffering from Ulcerative colitis (UC) or Crohn's disease (CD). This is a typical clinical data set consisting of 14 UC and 15 CD patients from which inflamed and non-inflamed colonic mucosa tissue was obtained via a biopsy. Our aim is to analyze differential expression (DE) between the tissues in the two diseases.

Download the raw data from from ArrayExpress
The first step of the analysis is to download the raw data CEL files. These files are produced by the array scanner software and contain the probe intensities measured. The data have been deposited at ArrayExpress and have the accession code E-MTAB-2967.
Each ArrayExpress data set has a landing page summarizing the data set, and we use the ArrayExpress Bioconductor package to obtain the ftp links to the raw data files (Data from Palmieri et al. on ArrayEpress).

Information stored in ArrayExpress
Each dataset at ArrayExpress is stored according to the MAGE-TAB (MicroArray Gene Expression Tabular) specifications as a collection of tables bundled with the raw data. The MAGE-TAB format specifies up to five different types of files, namely the Investigation Description Format (IDF), the Array Design Format (ADF), the Sample and Data Relationship Format (SDRF), the raw data files and the processed data files.
For use, the IDF and the SDRF file are important. The IDF file contains top level information about the experiment including title, description, submitter contact details and protocols. The SDRF file contains essential information on the experimental samples, e.g. the experimental group(s) they belong to.

Download the raw data and the annotation data
With the code below, we download the raw data from ArrayExpress 2 . It is saved in the directory raw_data_dir which defaults to the subdirectory rawDataMAWorkflow of the current working directory. The names of the downloaded files are returned as a list.
The raw data consists of one CEL file per sample (see below) and we use the CEL file names as row names for the imported data. These names are given in a column named Array.Data.File in the SDRF table. We turn the SDRF table into an AnnotatedDataFrame from the Biobase package that we will need later to create an ExpressionSet for our data 3 .

Bioconductor ExpressionSets
Genomic data can be very complex, usually consisting of a number of different bits and pieces, e.g. information on the experimental samples, annotation of genomic features measured as well as the experimental data itself In Bioconductor the approach is taken that these pieces should be stored in a single structure to easily manage the data.
The package Biobase contains standardized data structures to represent genomic data. The ExpressionSet class is designed to combine several different sources of information (i.e. as contained in the various MAGE-TAB files) into a single convenient structure. An ExpressionSet can be manipulated (e.g., subsetted, copied), and is the input to or output of many Bioconductor functions.
The data in an ExpressionSet consist of • assayData: Expression data from microarray experiments.
• metaData: A description of the samples in the experiment (phenoData), metadata about the features on the chip or technology used for the experiment (featureData), and further annotations for the features, for example gene annotations from biomedical databases (annotation).
• experimentData: A flexible structure to describe the experiment.
The ExpressionSet class coordinates all of these data, so that one does not have to worry about the details. However, some constrains have to be met. In particular, the rownames of the phenoData (which holds the content of the SDRF file) have to match the column names of the assay data (as they represent the sample identifiers), while the row names of the expression data have to match the row names of the featureData (as they represent the feature identifiers). This is illustrated in the figure.
You can use the functions pData and fData to extract the sample and feature annotation respectively from an ExpressionSet. The function exprs will return the expression data itself as a matrix.

Import of the raw microarray data
The analysis of Affymetrix arrays starts with CEL files. These are the result of the processing of the raw image files using the Affymetrix software and contain estimated probe intensity values. Each CEL file additionally contains some metadata, such as a chip identifier.
The function read.celfiles from the oligo 4 can be used to import the files. The package automatically uses pd.hugene.1.0.st.v1 as the chip annotation package as the chip-type is also stored in the .CEL files.
We specify our AnnotatedDataFrame created earlier as phenoData. Thus, We have to be sure that we import the CEL files in the order that corresponds to the SDRF table -to enforce this, we use the column Array.Data.
File of the SDRF table as the filenames argument.
Finally, we check whether the object created is valid. (e.g. sample names match between the different tables).
We collect the information about the CEL files and import and them into the variable raw_data: raw_data <-read.celfiles(filenames = file.path(raw_data_dir, SDRF$Array.Data.File), verbose = FALSE, phenoData = SDRF) validObject(raw_data) We now inspect the raw data a bit and retain only those columns that are related to the experimental factors of interest (identifiers of the individuals, disease of the individual and the mucosa type).
A wide range of quality control plots can be created using the package arrayQualityMetrics 5 . The package produces an html report, containing the quality control plots together with a description of their aims and an identification of possible outliers. We don't discuss this tool in detail here, but the code below can be used to create a report for our raw data.

Background adjustment, calibration, summarization and annotation
Background adjustment After the initial import and quality assessment, the next step in processing of microarray data is background adjustment. This is essential because a part of the measured probe intensities are due to non-specific hybridization and the noise in the optical detection system. Therefore, observed intensities need to be adjusted to give accurate measurements of specific hybridization.

Across-array normalization (calibration)
Without proper normalization across arrays, it is impossible to compare measurements from different array hybridizations due to many obscuring sources of variation. These include different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects, and laboratory conditions.

Summarization
After normalization, summarization is needed because on the Affymetrix platform transcripts are represented by multiple probes. For each gene, the background adjusted and normalized intensities need to be summarized into one quantity that estimates an amount proportional to the amount of RNA transcript.
After the summarization step, the summarized data can be annotated with various information, e.g. gene symbols and EMSEMBL gene identifiers. There is an annotation database available from Bioconductor for our platform, namely the package hugene10sttranscriptcluster.db.
You can view its content like this head(ls("package:hugene10sttranscriptcluster.db")) Additional information is available from the reference manual of the package. Essentially, the package provides a mapping from the transcript cluster identifiers to the various annotation data.
Old and new "probesets" of Affymetrix microarrays Traditionally, Affymetrix arrays (the so-called 3' IVT arrays) were probeset based: a certain fixed group of probes were part of a probeset which represented a certain gene or transcript (note however, that a gene can be represented by multiple probesets).
The more recent "Gene" and "Exon" Affymetrix arrays are exon based and hence there are two levels of summarization. The exon level summarization leads to "probeset" summary. However, these probesets are not the same as the probesets of the previous chips, which usually represented a gene/transcript. Furthermore, there are also no longer designated match/mismatch probes present on "Gene" type chips.
For the newer Affymetrix chips a gene/transcript level summary is given by "transcriptct clusters". Hence the appropriate annotation package is called hugene10sttranscriptcluster.db.
To complicate things even a bit more, note that the "Gene" arrays were created as affordable versions of the "Exon" arrays by taking the "good" probes from the Exon array. So the notion of a probeset is based on the original construction of the probesets on the Exon array, which contains usually at least four probes.
But since Affymetrix selected only a the subset of "good" probes for the Gene arrays, a lot of the probesets on the "Gene" arrays are made up of three or fewer probes. Thus, a summarization on the probeset/exon level is not recommended for "Gene" arrays but nonetheless possible by using the hugene10stprobeset.db annotation package.

One-go preprocessing in oligo
The package oligo allows us to perform background correction, normalization and summarization in one single step using a deconvolution method for background correction, quantile normalization and the RMA (robust multichip average) algorithm for summarization.
This series of steps as a whole is commonly referred to as RMA algorithm, although strictly speaking RMA is merely a summarization method 6-8 .

Background correcting Normalizing Calculating Expression
The parameter target defines the degree of summarization, the default option of which is "core", using transcript clusters containing "safely" annotated genes. Other options for target include "extended" and "full". For summarization on the exon level (not recommended for Gene arrays), one can use "probeset" as the target option.
Although other methods for background correction and normalization exist, RMA is usually a good default choice. RMA shares information across arrays and uses the versatile quantile normalization method that will make the array intensity distributions match. However, it is preferable to apply it only after outliers have been removed. The quantile normalization algorithm used by RMA works by replacing values by the average of identically ranked (with a single chip) values across arrays. A more detailed description can be found on the Wikipedia page about it.
An alternative to quantile normalization is the vsn algorithm, that performs background correction and normalization by robustly shifting and scaling log-scale intensity values within arrays 9 . This is less "severe" than quantile normalization.
Some mathematical background on normalization (calibration) and background correction A generic model for the value of the intensity Y of a single probe on a microarray is given by where B is a random quantity due to background noise, usually composed of optical effects and non-specific binding, α is a gain factor, and S is the amount of measured specific binding. The signal S is considered a random variable as well and accounts for measurement error and probe effects. The measurement error is typically assumed to be multiplicative so we can write: Here θ represents the logarithm of the true abundance, ϕ is a probe-specific effect, and ε accounts for the nonspecific error. This is the additive-multiplicative-error model for microarray data used by RMA and also the vsn algorithm 9 . The algorithms differ in the way B is removed and an estimate of θ is obtained. Quality assessment of the calibrated data We now produce a clustering and another PCA plot using the calibrated data. In order to display a heatmap of the sample-to-sample distances, we first compute the distances using the dist function. We need to transpose the expression values since the function computes the distances between the rows (i.e. genes in our case) by default. The default distance is the Euclidean one. However this can be changed and we choose the manhatten distance here (it uses Figure 5. Heatmap of the sample to sample distances for the calibrated data. absolute instead of squared distances). We set the diagonal of the distance matrix to NA in order to increase the contrast of the color coding. Those diagonal entries do not contain information since the distance of a sample to itself is always equal to zero. (qplot(PC1, PC2, data = dataGG, color = Disease, shape = Phenotype, main = "PCA plot of the calibrated data", size = I(2), asp = 1.0) + scale_colour_brewer(palette = "Set2")) dists <-as.matrix(dist(t(exp_palmieri), method = "manhattan")) colnames(dists) <-NULL diag(dists) <-NA rownames(dists) <-pData(palmieri_eset)$Factor.Value.phenotype. hmcol <-colorRampPalette(rev(brewer.pal(9, "PuOr")))(255) pheatmap(dists, col = rev(hmcol), clustering_distance_rows = "manhattan", clustering_distance_cols = "manhattan") The second PC roughly separates Crohn's disease from ulcerative colitis, while the first separates the tissues. This is what we expect: the samples cluster by their experimental conditions. On the heatmap plot we also see that the samples do not cluster strongly by tissue, confirming the impression from the PCA plot that the separation between the tissues is not perfect. The stripe in the heatmap might correspond to outlier that could potentially remove. The arrayQuality-Metrics package produces reports that compute several metrics that can be used for outlier removal.

Filtering based on intensity
We now filter out lowly expressed genes. Microarray data commonly show a large number of probes in the background intensity range. They also do not change much across arrays. Hence they combine a low variance with a low intensity. Thus, they could end up being detected as differentially expressed although they are barely above the "detection" limit and are not very informative in general. We will perform a "soft" intensity based filtering here, since this is recommended by limma's 10,11 user guide (a package we will use below for the differential expression analysis). However, note that a variance based filter might exclude a similar set of probes in practice. In the histogram of the gene-wise medians, we can clearly see an enrichment of low medians on the left hand side. These represent the genes we want to filter.
In order to infer a cutoff from the data, we inspect the histogram of the median-intensities. We visually fit a central normal distribution given by 0.5 · N(5.1, 1.18) to the probe-wise medians, which represents their typical behavior in the data set at hand.
Then we use the 5% quantile of this distribution as a threshold, We keep only those genes that show an expression higher than the threshold in at least as many arrays as in the smallest experimental group.

Annotation of the transcript clusters
Before we continue with the linear models for microarrays and differential expression we describe how to add "feature Data", i.e. annotation information to the transcript cluster identifiers stored in the featureData of our ExpressionSet. We use the function select from AnnotationDbi to query the gene symbols and associated short descriptions for the transcript clusters. For each cluster, we add the gene symbol and a short description of the gene the cluster represents.
Joining by: "PROBEID" # restore rownames after left_join rownames(fData(palmieri_final)) <-fData(palmieri_final)$PROBEID validObject(palmieri_final) [1] TRUE Alternatively, one can re-map the probes of the array to a current annotation, a workflow to do this for Illumina arrays is given in 12. Essentially, the individual probe sequences are re-aligned to an in-silico "exome" that consists of all annotated transcript exons.
In any case, the package pdInfoBuilder can be used to build custom annotation packages for use with oligo. In order to do this, PGF/CLF files (called "library files" on the Affymetrix website) as well as the probeset annotations are required. The probesets typically represent a small stretches of the genome (such as a single exon) and multiple probesets are then used to form a transcript cluster.
The CLF file contains information about the location of individual probes on the array. The PGF file then contains the individual probe sequences and shows the probeset they belong to. Finally, The probeset annotation .csv then contains information about which probesets are used in which transcript cluster. Commonly, multiple probesets are used in one transcript cluster and some probesets are contained in multiple transcript clusters.

A short overview of linear models
I am afraid this section is rather technical. However general experience shows that most questions on the Bioconductor support site about packages using using linear models like limma 10 , DESeq2 13 and edgeR 14 are actually not so much about the packages themselves but rather about the underlying linear models. It might also be helpful to learn a bit of linear algebra to understand the concepts better. The Khan Academy offers nice (and free) online courses. Mike Love's and Michael Irizzary's genomics class is also a very good resource, especially its section on interactions and contrasts.

Regression models
In regression models we use one variable to explain or predict the other. It is customary to plot the predictor variable on the x-axis and the predicted variable on the y-axis. The predictor is also called the independent variable, the explanatory variable, the covariate, or simply x. The predicted variable is called the dependent variable, or simply y.
In a regression problem the data are pairs (x i , y i ) for i = 1, . . . , n. For each i, y i is a random variable whose distribution depends on x i . We write ( ) .
The above expresses y i as a systematic or explainable part g(x i ) and an unexplained part ε i . Or more informally: response = signal + noise. g is called the regression function. Once we have an estimate ĝ of g, we can compute The r i 's are called residuals. The ε i 's themselves are called errors.
Residuals are used to evaluate and assess the fit of models for g. Usually one makes distributional assumption about them, e.g. that they are independent and identically normally distributed with identical variance σ 2 and mean zero: This allows to derive statistical tests for model coefficients.

Linear regression models
Linear regression is a special case of the general regression model. Here, we combine the predictors linearly to produce a prediction. If we have only single predictor x, the simple linear regression model is: We can of course always add more predictors, let their total number be denoted by p. Then we get a multiple linear regression: The equation for a multiple linear regression model can also be written in matrix form (we will denote matrices and vectors in bold font).
With X being the so called design matrix: n is a column vector of ones called the intercept and X p = (x p1 , . . . , x pn ) T is a column vector of measurements for covariate p. The regression coefficients are commonly estimated by the method of "ordinary least squares" (OLS): leading to the estimate of the coefficient vector Creating design matrices in R To get an idea of what design matrices look like, we consider several examples. It is important to know some fundamentals about design matrices in order to be able to correctly transfer a design of a particular study to an appropriate linear model.
We will use the base R functions: . . . in order to produce design matrices for a variety of linear models. R uses the formula interface to create design matrices automatically. In the first example, we have two groups of two samples each. Using formula and model.matrix we will create a model matrix with so called treatment contrast parameterization (the default setting in R). This means that an intercept is included in the model, i.e. X 0 = n and X 1 is equal to 1 if the samples belongs to group two and zero otherwise as we can see in the following code chunk.
two_groups <-factor(c(1, 1 ,2, 2)) f <-formula(~ two_groups) model.matrix(f) [1] 0 1 attr(,"contrasts") attr(,"contrasts")$two_groups [1] "contr.treatment" This design is called treatment contrast parameterization for an obvious reason: the first column of the design matrix represents a "base level", i.e the mean β 0 for group one and the second column, corresponding to β 1 , represents the difference between the group means since all group two samples have means represented by β 0 + β 1 . As β 0 is the mean of group 1, β 1 corresponds to the difference of the means of group two and group one and thus shows the effect of a "treatment".
However, this design is not orthogonal, i.e. the columns of the design matrix are not independent. We can construct an equivalent orthogonal design as follows: Here, we loose the nice direct interpretability of the coefficients. Now β 1 is simply the mean of the second group. We will discuss the extraction of interesting contrasts (i.e. linear combinations of coefficients) from a model like this below.
We explicitly excluded the intercept by specifying it as zero. Commonly it makes sense to include an intercept in the model, especially in more complex models. We can specify a more complex design pretty easily: if we have two independent factors, the base mean now corresponds to the first levels of the two factors.
x <-factor(c(1, 1, 1, 1, 2, 2, 2, 2)) y <-factor(c("a", "a", "b", "b", "a", "a", "b", "b")) two_factors <-model.matrix(~ x + y) two_factors (Intercept) x2 yb 1 1 0 0 2 1 0 0 3 1 0 1 4 1 0 1 5 1 1 0 6 1 1 0 7 1 1 1 8 1 1 1 attr(,"assign") [1] 0 1 2 attr(,"contrasts") attr(,"contrasts")$x [1] "contr.treatment" attr(,"contrasts")$y [1] "contr.treatment" The "drop one" strategy is the default method for creating regression coefficients from factors in R. If a factor has d levels, adding it to the model will give you d − 1 regression coefficients corresponding to d − 1 columns in a design matrix. Apart from excluding the intercept, you can also use the I function to treat a covariate as it is without using the formula syntax. The code below includes z 2 as a covariate. 1 4 16 attr(,"assign") [1] 0 1 2 Singular model matrices Whatever your model matrix looks like, you should make sure that it is non-singular. Singularity means that the measured variables are linearly dependent and leads to regression coefficients that are not uniquely defined. In linear algebra terms, we say that the matrix does not have full rank, which for design matrices means that the actual dimension of the space spanned by the column vectors is in fact lower than the apparent one. This leads to a redundancy in the model matrix, since some columns can be represent by linear combinations of other columns.
For design matrices, which contain factors, this happens if two conditions are confounded, e.g. in one experimental group there are only females and in the other group there are only males. Then the effect of sex and experimental group cannot be disentangled.
Let's look at an example. We set three factors, of which the third one is nested with the first two. We can check the singularity of the model matrix by computing its so called singular value decomposition and check it's minimal singular value. If this is zero, the matrix is singular. As we can see, this is indeed the case here.

Using contrasts to test hypotheses in linear regression
In differential expression analysis, our most important covariates will be factors that differentiate between two or more experimental groups, e.g. the covariate X p = (x p1 , . . . , x pn ) is either zero or one depending on which group the sample belongs to.
We will illustrate this concept using a small data set called toycars from the DAAG package. The data set toycars gives the distance traveled by one of three different toy cars on a smooth surface, starting from rest at the top of a 16 inch long ramp tilted at varying angles. We have the variables: • angle: Angle We transform car into a factor so that R performs the necessary parameterization of the contrasts automatically.
toycars$car <-as.factor(toycars$car) qplot(car, distance, data = toycars, geom="boxplot") + geom_point() By looking at the box plots of distance by car, we can clearly see differences between the three types of cars. We can now fit a linear model with distance as the dependent variable and car and angle as the predictors. As we can see from the linear model output, the treatment contrast parameterization was used, with car 1 being the base level. Testing general linear hypotheses The estimated coefficients now give us the difference between the distance traveled between car 1 and car 2 (0.11) and car 1 and car 3 (-0.08), and the associated t-tests of these coefficients. However we cannot see a test of car 2 vs. car 3. This contrast test would correspond to testing the difference between the car 2 and car 3 regression coefficients. Thus, contrasts of interest to us may not readily correspond to coefficients in a fitted linear model. However, one can easily test general linear hypotheses of the coefficients of the form:

0
( 1) ( 1) 1 1 Where C is the contrast matrix containing the between group differences of interest, q is the total number of comparisons to be performed and α contains the difference to be tested, this is usually a vector of zeros. If one tests multiple coefficients at once (e.g. β 1 = 0 and β 2 = 0 ) the corresponding test statistic is F-distributed. If one just tests linear combinations of coefficients, e.g. β 1 − β 2 = 0, β 1 − β 2 − 2β 3 = 0 or something similar the test statistic has t-distribution. The function summary for lm reports the results of β 1 = β 2 = . . . = β p = 0 (F-test) and the t-tests of β j = 0 each of the coefficients.
Note that the model does not actually have to be refitted in order to test the contrasts. This makes contrast matrix based testing more efficient and convenient than reformulating the model using a new paramerization of the factors to obtain the desired tests. We can use the function glht from the multcomp package to test these general linear hypotheses 15 . Let us assume we want all pairwise comparisons between the cars, this can be achieved by defining a specific contrast matrix for our current model as given below.
There are three such comparisons and we can print the results by using the summary function. As we can see the estimates for the contrasts already contained in the original model agree with the obtained results from contrast fit.
We can also fit a linear model without an intercept to the toycars data set. Now, the coefficients derived from the "car" factor represent car-wise means. Thus, the contrasts we have to form change, however, the results for the group comparisons do not.

Linear models for microarrays
We now apply linear models to microarrays. Specifically, we discuss how to use the limma for differential expression analysis. The package is designed to analyze complex experiments involving comparisons between many experimental groups simultaneously while remaining reasonably easy to use for simple experiments. The main idea is to fit a linear model to the expression data for each gene. Empirical Bayes and other shrinkage methods are used to borrow information across genes for the residual variance estimation leading to a "moderated" t-statistics, and stabilizing the analysis for experiments with just a small number of arrays 11 .
In the following, we use appropriate design and contrast matrices for our linear models and fit a linear model to each gene separately.
A linear model for the data The original paper is interested in changes in transcription that occur between inflamed and adjacent non-inflamed mucosal areas of the colon. This is studied in both inflammatory bowel disease types.
Since we have two arrays per individual, the first factor we need is a blocking factor for the individuals that will absorb differences between them. Then we create a factors that give us the grouping for the diseases and the tissue types. We furthermore simplify the names of the diseases to UC and DC, respectively. Then, we create two design matrices, one for each of the two diseases as we will analyze them separately in order to follow the analysis strategy of the original paper closely (one could also fit a joint model to the complete data set, however, the two diseases might behave very differently so that a joint fit might not be appropriate).

Contrasts and hypotheses tests
We now fit the linear models and define appropriate contrasts to test hypotheses of interest. We want to compare the inflamed to the the non-inflamed tissue. Thus, we create a contrast matrix consisting of one row. limma 's function makeContrasts creates this matrix from a synbolic description of the contrast of interest. We can fit the linear model, compute the moderated t-statistics by calling the eBayes function and finally extract the number of differentially expressed genes while controlling the FDR by requiring BH-corrected p-value below a certain threshold.

Extracting results
Results can be extracted by use of the topTable function. We extract the comparisons for both Crohn's disease as well as ulcerative colitis and sort the results by their absolute t-statistics. As a diagnostic check, we also plot the p-value histogram: We expect a uniform distribution for the p-values that correspond to true null hypotheses, while the a peak near zero shows a enrichment for low p-values corresponding to differentially expressed (DE) genes. A p-value less than 0.001 was used in the original paper as a significance cutoff leading to 298 (CD) and 520 (UC) DE-genes for the two diseases.
We call around 500/1000 genes in the two conditions at the same cutoff, this higher number of DE genes identified is probably due to the increased power from the blocking according to the individuals and the moderated variance estimation that limma performs. hist(table_UC$P.Value, col = brewer.pal(3, name = "Set2")[2], main = "inflammed vs non-imflamed -Ulcerative colitis", xlab = "p-values")

Comparison to the paper results
We now compare our list of differentially expressed genes to the results obtained in the paper. The paper results can be downloaded as excel files from http://links.lww.com/IBD/A795. We save it in an .xlsx file named palmieri_DE_ res.xlsx. The paper results are given as identified differentially expressed genes with a p-value less than 0.001, which corresponds to an FDR of 0.05 in Crohn's disease and 0.02 in ulcerative colitis. There are four tables in total giving the list of up and downregulated genes in CD and UC respectively. In the code below, we extract the gene symbols from the excel table and then compare them to the differentially expressed genes we identify at a p-value of 0.001.

Gene ontology (GO) based enrichment analysis
We can now try characterize the identified differentially expressed genes a bit better by performing an GO enrichment analysis. Essentially the gene ontology (http://www.geneontology.org/) is a hierarchically organized collection of functional gene sets [16][17][18] .

Matching the background set of genes
The function genefinder from the genefilter 19 will be used to find a background set of genes that are similar in expression to the differentially expressed genes. We then check whether the background has roughly the same distribution of average expression strength as the foreground.
We do this in order not to select a biased background since the gene set testing is performed by a simple Fisher test on a 2×2 table. Note that this approach is very similar to commonly used web tools like GOrilla 20 . Here we focus on the CD subset of the data.
For every differentially expressed gene, we try to find genes with similar expression. multidensity(list( all= table_CD[,"AveExpr"] , fore= table_CD[DE_genes_CD , "AveExpr"], back= table_CD[rownames(table_CD) %in% back_genes, "AveExpr"]), col = c("#e46981", "#ae7ee2", "#a7ad4a"), xlab="mean expression", main = "DE genes for CD -background -matching") We can see that the matching returned a sensible result and can now perform the actual testing. For this purpose we use the topGO which implements a nice interface to Fisher testing and also has additional algorithms taking the GO structure into account, by e.g. only reporting the most specific gene set in the hierarchy 21 .
The GO has three top ontologies, cellular component (CC), biological processes (BP), and molecular function (MF). For illustrative purposes we limit ourselves to the BP category here.

Running topGO
We first create a factor all_genes which indicates for every gene in our background/universe, whether it is differentially expressed or not.
We run two common tests: an ordinary Fisher test for every GO category, and the "elim" algorithm, which tries to incorporate the hierarchical structure of the GO and tries "decorrelate" it in order to report the most specific significant term in the hierarchy.
The algorithm starts processing the nodes/GO categories from the highest (bottommost) level and then iteratively moves to nodes from a lower level. If a node is scored as significant, all of its genes are marked as removed in all ancestor nodes. This way, the "elim" algorithm aims at finding the most specific node for every gene.
The tests uses a 0.01 p-value cutoff by default.

Visualization of the GO-analysis results
A graph of the results can also be produced. Here we visualize the three most significant nodes according to the Fisher elim algorithm in the context of the GO hierarchy.
Gene set enrichment analysis has been a field of very extensive research in bioinformatics. For additional approaches see the topGO vignette and the references therein and also in the GeneSetEnrichment view.

A pathway enrichment analysis using reactome
The package ReactomePA offers the possibility to test enrichment of specific pathways using the free, open-source, curated and peer reviewed pathway Reactome pathway database 22,23 . The package requires entrez identifiers, so we convert our PROBEIDs (trancript cluster identifiers) to entrez identifiers using the function mapIDs from the package AnnotationDbi. This will create a named vector that maps the PROBEIDs to the entrez ones.
entrez_ids <-mapIds(hugene10sttranscriptcluster.db, keys = rownames(table_CD), keytype="PROBEID", column = "ENTREZID") We can now run the enrichment analysis that performs a statistical test based on the hypergeoemtric distribution that is the same as a one sided Fisher-test, which topGO calls "Fisher-classic". Details can be found in the vignette of the DOSE package 24 .  Note that we trimmed pathway names to 20 characters.

Visualizing the reactome based analysis results
The reactomePA package offers nice visualization capabilities. The top pathways can be displayed as a bar char that displays all categories with a p-value below the specified cutoff.

barplot(reactome_enrich)
The "enrichment map" displays the results of the enrichment analysis as a graph, where the color represents the p-value of the pathway and the edge-thickness is proportional to the number of overlapping genes between two pathways.
The package clusterProfiler 25 can also perform these analyses using downloaded KEGG data. Furthermore, the package EnrichmentBrowser 26 additionally offers network-based enrichment analysis of individual pathways. This allows the mapping of the expression data at hand to known regulatory interactions.

Session information
As the last part of this document, we call the function sessionInfo, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record of this as it will help to track down what has happened in case an R script ceases to work or gives different results because the functions have been changed in a newer version of one of your packages. By including it at the bottom of a script, your reports will become more reproducible.  The session information should also always be included in any emails to the Bioconductor support site along with all code used in the analysis. Klaus illustrates an analysis workflow of Affymetrix microarrays for an experimental design with paired samples from two tissues in two separate diseases. The workflow covers steps including downloading and loading raw CEL files into R/Bioconductor, pre-processing and normalization, differential analysis via limma, and functional/pathway enrichment analysis, with detailed R scripts throughout. The workflow is a nice complement to resources that are already available on similar topics (notably the Limma user's guide) as it unites into a single document topics that have been discussed elsewhere in diverse forums. :

Major remarks
Although the normalization/linear model/contrast sections are technically correct, I find them to be overly disruptive to the presentation of the analysis. In particular, I would suggest moving the normalization section to an appendix. For the linear models section, presentation of ordinary least squares estimators, matrix notation, distributional assumptions of residuals, etc. seems outside of the scope of this work (or could potentially be presented in an appendix as well). I think that it would be more helpful if, rather than using a separate illustrative example based on the toycars data, the author described in detail: 1) the design matrix for the Palmieri example; 2) a linear model for a single gene from the Palmieri data; 3) a linear model fit independently for each gene, sharing information across all genes (limma), with a brief discussion about the advantage of such an approach; 4) a simplified discussion of various contrasts relevant for this particular study.
I was surprised to see that differential analysis conclusions are based on the raw p-values rather than those adjusted for multiple testing.
I have two additional suggestions that could be of great practical interest to many readers: I think it would be nice to have a discussion (or at least some references) discussing what can be done if hypotheses are NOT met (e.g., the raw p-values are not uniformly distributed between 0 and 1 for genes under H0). I have in mind the author's recent discussion about using fdrtool to estimate the variance of the null model in the context of differential analyses via DESeq2 (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html).
In addition to the fixed patient effect model presented here, I think it would be interesting to present an alternative strategy possible with for such a design: estimating correlation within patient limma blocks using duplicateCorrelation(), see for example section 17.3.6 in the User's Guide. limma Note that I am not suggesting a detailed comparison or presentation of mixed models (that is of course well beyond the scope of this work!), but it may be useful to discuss the code/results for 2. 1.

4.
Note that I am not suggesting a detailed comparison or presentation of mixed models (that is of course well beyond the scope of this work!), but it may be useful to discuss the code/results for such an approach.
: Minor remarks I would suggest specifying in the abstract that the study "...compares paired inflamed and non-inflamed colon tissue...", as this was not immediately clear otherwise. In addition, on page 2/paragraph 1, the author writes that the original paper studied "differences between patients suffering UC or CD", which makes it sound like the disease comparison was of interest, when in fact it was the intra-patient tissue differences that were studied. It may in fact be helpful to present the experimental design in greater detail at the start of the paper; it was not clear to me until page 23 that the design actually involved paired samples from individuals.
Page 12, the author says "In order to infer a cutoff from the data, we inspect the histogram of the median-intensities. We visually fit a central normal distribution given by 0.5 · N(5.1, 1.18) to the probe-wise medians". From the R code, it appears that the distribution fit was N(emp_mu = 5.3, emp_sd = 1.19), where emp_mu and emp_sd were estimated from the median expression values --this is unclear from the expression "visually fit" in the text description. Also unclear to me is why prop_cental was set to be 0.5?
The phrase "from the XXXX" is often used when referencing R packages, rather than "from XXXX" or "from the XXX package".
The comparison to the original paper results (page 26) seems unnecessary.
I found it a bit distracting that none of the figures are explicitly referred to by number in the text (e.g., "In Figure 1, ...").
There are occasionally very verbose outputs included in the text (e.g. head(pData(raw_data))) that produce multiple pages of output that are not particularly useful.
Since a very large number of R packages are loaded for the analysis on page 2, it would likely be helpful to either reduce the number of packages needed to be the strict minimum necessary, or to group packages by theme/functionality via comments, for example: Klaus illustrates an analysis workflow of Affymetrix microarrays for an experimental design with paired samples from two tissues in two separate diseases. The workflow covers steps including downloading and loading raw CEL files into R/Bioconductor, pre-processing and normalization, differential analysis via limma, and functional/pathway enrichment analysis, with detailed R scripts throughout. The workflow is a nice complement to resources that are already available on similar topics (notably the Limma user's guide) as it unites into a single document topics that have been discussed elsewhere in diverse forums. #Major remarks: # remark 1 Although the normalization/linear model/contrast sections are technically correct, I find them to be overly disruptive to the presentation of the analysis. In particular, I would suggest moving the normalization section to an appendix. For the linear models section, presentation of ordinary least squares estimators, matrix notation, distributional assumptions of residuals, etc. seems outside of the scope of this work (or could potentially be presented in an appendix as well). I think that it would be more helpful if, rather than using a separate illustrative example based on the toycars data, the author described in detail: 1) the design matrix for the Palmieri example; 2) a linear model for a single gene from the Palmieri data; 3) a linear model fit independently for each gene, sharing information across all genes (limma), with a brief discussion about the advantage of such an approach; 4) a simplified discussion of various contrasts relevant for this particular study. Answer: The linear model sections not directly related to the dataset at hand were removed from the workflow. In general, we have tried to tailor the revised workflow to beginners in R and Bioconductor. Therefore, rather than discussing elaborate This is helpful for the reader who needs to deviate from our workflow for his or her own data and maintains a straightforward workflow design.
Concerning the single points mentioned in the comment above: 1) The design matrix for the Palmieri example is now explained in detail.
2) In the subsequent part of the workflow, the advise of the reviewer to implement a linear model for a single gene from the Palmieri data was adopted by fitting a linear model to the "CRAT" gene, and testing it for differential expression.
3) The advantage of shared information across all genes is discussed briefly when introducing the "eBayes"-method in the "Contrasts and Hypotheses tests" part.

4)
In this workflow, we restrict the contrast analysis to the contrast between non-inflamed and inflamed tissue, as it is the one that is also analyzed in the original paper, and do not include additional contrasts in order to keep the workflow concise.
# remark 2 I was surprised to see that differential analysis conclusions are based on the raw p-values rather than those adjusted for multiple testing.
Answer: This is indeed a possible cause of confusion. In the workflow, parts of the analysis are based on raw p-values in order to make the results comparable to the original paper. However, we now mention the caveats of this approach more, introduce FDRs in a "hands-on" manner explicitly and caution against the use of raw p--values in practice.
For the subsequent enrichment analyses, DE genes are identified using an FDR cutoff of 10%.
I have two additional suggestions that could be of great practical interest to many readers: # remark 3 I think it would be nice to have a discussion (or at least some references) discussing what can be done if hypotheses are NOT met (e.g., the raw p-values are not uniformly distributed between 0 and 1 for genes under H0). I have in mind the author's recent discussion about using fdrtool to estimate the variance of the null model in the context of differential analyses via DESeq2 (http://www-huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html).
Answer: As mentioned above, and in concordance with the requests of Jim, we tried to avoid introducing any additional (advanced) methods in order we tried to avoid introducing any additional (advanced) methods in order to not confuse the novice R/Bioconductor user. Nevertheless, for workflow users encountering the problem of unusual p-value distributions, we have included references to this phenomenon and its implications by referring to the article on false discovery rate estimation by Korbininan Strimmer and chapter 1-6 of Efron's book on Large-Scale Inference.

# remark 4
In addition to the fixed patient effect model presented here, I think it would be interesting to present an alternative strategy possible with limma for such a design: estimating correlation within patient blocks using duplicateCorrelation(), see for example section 17.3.6 in the limma User's Guide. Note that I am not suggesting a detailed comparison or presentation of mixed models (that is of course well beyond the scope of this work!), but it may be useful to discuss the code/results for such an approach.
Answer: We acknowledge the advantages in certain applications of a mixed model using duplicateCorrelation() compared to a fixed patient effect model used in this workflow.
Given that the workflow is tailored towards unexperienced users, we consider the fixed patient effect model more intuitively understandable. However, we now note explicitly that such an analysis could be performed.
Minor remarks: # remark 5 I would suggest specifying in the abstract that the study "...compares paired inflamed and non-inflamed colon tissue...", as this was not immediately clear otherwise. In addition, on page 2/paragraph 1, the author writes that the original paper studied "differences between patients suffering UC or CD", which makes it sound like the disease comparison was of interest, when in fact it was the intra-patient tissue differences that were studied. It may in fact be helpful to present the experimental design in greater detail at the start of the paper; it was not clear to me until page 23 that the design actually involved paired samples from individuals.
Answer: The experimental design is now described in greater detail in abstract and introduction. Also, the sentence that the original paper studied "differences between patients suffering from UC or CD" was removed.

# remark 6
Page 12, the author says "In order to infer a cutoff from the data, we inspect the histogram of the median-intensities. We visually fit a central normal distribution given by 0.5 · N(5.1, 1.18) to the probe-wise medians". From the R code, it appears that the distribution fit was N(emp_mu = 5.3, emp_sd = 1.19), where emp_mu and emp_sd were estimated from the median expression values --this is unclear from the expression "visually fit" in the text description. Also unclear to me is why prop_cental was set to be 0.5? Answer: In an attempt to make low-intensity-filtering more intuitive, we removed the fitting a normal distribution the the histogram of the probe-wise medians. Instead, we now visually set a vertical cutoff line to the histogram and filter genes with a lower intensity than the cutoff in at least as many samples as the smallest experimental group has.
# additional minor remarks The phrase "from the XXXX" is often used when referencing R packages, rather than "from XXXX" or "from the XXX package".
Answer: Thank you, the respective sentences were corrected accordingly.
The comparison to the original paper results (page 26) seems unnecessary.
Answer: The comparison to the original paper serves as a "proof of principle" for the implemented workflow (to show that what we are doing makes sense). Additionally, we wanted to show that it is relatively straightforward to re--analyze publicly available microarray data using R and Bioconductor.
I found it a bit distracting that none of the figures are explicitly referred to by number in the text (e.g., "In Figure 1, ...").
Answer: Thank you, the respective parts were changed accordingly.
There are occasionally very verbose outputs included in the text (e.g. head(pData(raw_data))) that produce multiple pages of output that are not particularly useful.
Answer: The output was shortened where appropriate. The pdata call in particular was retained, as we feel that it is interesting to the readers to see what kind of information comes with a dataset from Array Express.
Answer: Thanks, we tried to perform thorough spell-checking in the revised version.
Since a very large number of R packages are loaded for the analysis on page 2, it would likely be helpful to either reduce the number of packages needed to be the strict minimum necessary, or to group packages by theme/functionality via comments, for example: This manuscript is intended to take the reader through a complete analysis of Affymetrix Gene ST arrays, based on a set of arrays downloaded from ArrayExpress. The author covers each step from quality control of the raw data all the way to making comparisons using linear models and testing and visualizing pathways or gene sets.

Major comments
While this manuscript is technically correct (e.g., the code does what the author claims, the explanatory text is valid), I am not sure it is as useful as it could be for its intended audience. In other words, this manuscript is intended to provide an inexperienced reader (inexperienced in either R/Bioconductor or statistics or both) with a road map they can follow to learn how to analyze microarray data. However, both the code and the statistical explanations are far too complex to be useful for such a person.
As an example, the author explains in mathematical terms what the background correction and As an example, the author explains in mathematical terms what the background correction and summarization steps are intended to accomplish. While this is an important step in the analysis, this could instead be explained in heuristic terms that would be far more approachable for a less statistically savvy audience, while still conveying the general idea.
The section on linear modeling and design matrices are similarly impenetrable for non-statisticians. The has dozens of examples of model matrices, with clear interpretations of the model limma User's Guide coefficients. Yet as the author notes, this is probably the number one question on the Bioconductor support site. Rather than re-explaining something that most people clearly don't understand, it would be much more helpful to focus on a single model matrix, and provide a clear, heuristic explanation. The easiest to understand is the most basic orthogonal model that computes the mean of each group, followed by a contrast matrix to make comparisons of interest.
The overview on linear models is well beyond the scope of this manuscript, and should be excised. The same is true for the section on testing general linear hypotheses. Ideally, an analyst using these tools would understand what they are doing from a statistical perspective, but it's difficult enough for a novice user to comprehend what the code is doing without trying to also understand the statistics.
Similarly, the code is more complex than necessary. If the goal is to teach novice users about Bioconductor packages, then the code should be restricted as much as possible to those packages and base R. While ggplot2 style graphics, magrittr style function piping and dplyr two-table verbs may be useful for more advanced R users, in this context they are an added distraction.
An example of overly-complex code is the filtering step. There are any number of ways to filter out genes that are arguably not expressed. The example is a very sophisticated way to perform this task, but a novice who is just learning doesn't require sophistication, they require something they can understand. Choosing a cutoff based on the distribution of probesets across each array, and then filtering all genes where fewer than 14 samples exceed this cutoff is not very sophisticated, but it is easy to understand, and would only require a few lines of code.
The learning curve for R and Bioconductor is steep, and making this manuscript both simpler in terms of the code, and more focused by excluding most of the statistics and any example analyses that do not involve the microarray data would make it more approachable for inexperienced users.

Minor comments
In the download step, getAE() is used to download the data, but then the SDRF file is downloaded directly. This is confusing, as getAE() has already downloaded that file. Is there a particular reason for the extra step?
To test for model matrices that are not full rank, it's easier to use either nonEstimable() or is.fullrank() from the limma package.

Conclusion
As noted above, this manuscript is technically accurate, and the code does exactly what the author claims. But it is too ambitious for the intended audience. Most people who could benefit from this manuscript are not statistically savvy, and the statistical sections will simply confuse them. In addition, it is not likely that many potential readers will be familiar with packages from the 'Hadleyverse', and not likely that many potential readers will be familiar with packages from the 'Hadleyverse', and incorporating those packages in the manuscript makes the code more difficult to understand.
Paring the manuscript down to very simple heuristic statistical explanations, and limiting the code to functions from base R and the various Bioconductor packages being illustrated would make the manuscript more useful for its intended audience.
No competing interests were disclosed. Competing Interests: I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above. While this manuscript is technically correct (e.g., the code does what the author claims, the explanatory text is valid), I am not sure it is as useful as it could be for its intended audience. In other words, this manuscript is intended to provide an inexperienced reader (inexperienced in either R/Bioconductor or statistics or both) with a road map they can follow to learn how to analyze microarray data. However, both the code and the statistical explanations are far too complex to be useful for such a person.
As an example, the author explains in mathematical terms what the background correction and summarization steps are intended to accomplish. While this is an important step in the analysis, this could instead be explained in heuristic terms that would be far more approachable for a less statistically savvy audience, while still conveying the general idea.
Answer: As suggested, the general style of the workflow was changed a lot in order to guide inexperienced readers through the analysis of microarray data. Code is explained in greater detail, and all technical parts not directly related to the data at hand have either been removed or improved.
However, we have decided to keep the the mathematical explanation of background correction and summarization, as we feel that the (rather simple) formulas are easier to understand than text.

## comment 2
The section on linear modeling and design matrices are similarly impenetrable for non-statisticians. The limma User's Guide has dozens of examples of model matrices, with clear interpretations of the model coefficients. Yet as the author notes, this is probably the number one question on the Bioconductor support site. Rather than re-explaining something that most people clearly don't understand, it would be much more helpful to focus on a single model matrix, and provide a clear, heuristic explanation. The easiest to understand is the most basic orthogonal provide a clear, heuristic explanation. The easiest to understand is the most basic orthogonal model that computes the mean of each group, followed by a contrast matrix to make comparisons of interest.
The overview on linear models is well beyond the scope of this manuscript, and should be excised. The same is true for the section on testing general linear hypotheses. Ideally, an analyst using these tools would understand what they are doing from a statistical perspective, but it's difficult enough for a novice user to comprehend what the code is doing without trying to also understand the statistics.
Answer: We gladly adopted this point of criticism and eliminated the rather technical part in which the theory behind design matrices and linear modelling is explained. Instead, we focused on explaining the design matrix we haved used in the dataset at hand in easy-to-follow terms and present its use in the context of single linear model for a single gene.

## comment 3
Similarly, the code is more complex than necessary. If the goal is to teach novice users about Bioconductor packages, then the code should be restricted as much as possible to those packages and base R. While ggplot2 style graphics, magrittr style function piping and dplyr two-table verbs may be useful for more advanced R users, in this context they are an added distraction.
An example of overly-complex code is the filtering step. There are any number of ways to filter out genes that are arguably not expressed. The example is a very sophisticated way to perform this task, but a novice who is just learning doesn't require sophistication, they require something they can understand. Choosing a cutoff based on the distribution of probesets across each array, and then filtering all genes where fewer than 14 samples exceed this cutoff is not very sophisticated, but it is easy to understand, and would only require a few lines of code.
Answer: While not all of the code from the packages like ggplot2, magrittr and dplyr was removed, an effort was made to explain the code in more detail in order to give the inexperienced user the possibility to understand what it does. In particular, we think that ggplot2 is by now very commonly used and probably becoming a defacto standard plotting package.
In the low-intensity filtering step, the fitting of a normal distribution to the histogram was eliminated for the sake of clarity. Instead, we now visually set a vertical cutoff line to the histogram and filter genes with a lower intensity than the cutoff in at least as many samples as the smallest experimental group has.

## comment 4
The learning curve for R and Bioconductor is steep, and making this manuscript both simpler in terms of the code, and more focused by excluding most of the statistics and any example analyses that do not involve the microarray data would make it more approachable for inexperienced users.
Answer: Thank you for this useful comment. We have made some effort to tailor